python


Python: parse html and produce a tabular text file


The problem: I want to parse an html code and retrieve a file of tabular text such as this:
East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...
What I get instead:
Only East Counties appears in the txt file, so the for loop fails to print each new region. Attempt code is after the html code.
HTML code:
The code can be found in this html page, of which this is the excerpt referring to the above table:
<h2>
East Counties</h2>
<table>
<thead>
<tr>
<th>
<span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
</th>
</tr>
</thead>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
</td>
</tr>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
</td>
</tr>
My attempt:
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
regions=[]
with open('Regions_and_files.txt', 'w') as f:
for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines
region=h2.text.strip() #Get the text of each h2 without the white spaces
regions.append(str(region))
f.write(region+'\n')
for tr in soup.find_all('tr')[1:]: # Skip headers
tds = tr.find_all('td')
if len(tds)==0:
continue
else:
a = tr.find_all('a')
link = str(a)[10:67]
span = tr.find_all('span')
places = int(str(span[3].text).replace(',', ''))
f.write("%s,%s,%s" % \
(str(tds[0].text)[1:-1], link, places)+'\n')
How can I fix this?
I'm not familiar with the Beautiful Soup library, but judging from the code it looks like in each h2 cycle you are traversing all the tr elements of the document. You should instead traverse only rows that belong to the table related to the specific h2 element.
Edited:
After a quick look at Beautiful Soup docs looks like you can use .next_sibling since h2 is always followed by the table, i.e. table = h2.next_sibling.next_sibling (called twice because the first sibling is a string containing whitespace). From the table you can then traverse all its rows.
The reason you are getting duplicates for Wales is because there actually are duplicates in the source.

Related Links

Returning when any Future in a list finishes
Python Django - matching query does not exist when using pk
R translation to Python
Are there downsides to using Python locals() for string formatting? [duplicate]
Compiled Python writing to Program Files
How to synchronize multiple django settings to the DB?
simulate pulling marbles from a bag without replacement (efficiently)
Datetime string doesn't match after strftime() and strptime()
Pandas HDFStore Tables doesn't accept multiindex columns
Two lists, if a common match is found print another list element [closed]
Is it possible to permanently change a variable inside a Python script from the same script?
Tweepy: simple script with 'Bad Authentication data' error
def function python - Help me with syntax
After the installion of django-bootstrap-toolkit, to use the css class in bootstrap, but it didn't work
Simple instructions for installing python easy_install on OSX
Python: Multiple conditions -in a pattern- for if/while

Categories

HOME
tensorflow
cil
signalr
ipython
smartphone
specflow
shader
metronic
google-shopping
factor-analysis
cython
distance
cvs2svn
raphael
embedly
cocoa-touch
k2
cronet
python-textprocessing
osmdroid
image-quality
dpi
rowcount
jquery-cycle2
text-files
dql
sql-like
topic-modeling
brute-force
wcf-security
tripwire
median
rkt
ajp
dimple.js
web-deployment-project
tiddlywiki
execl
python-hypothesis
jcreator
teiid
bettercms
efxclipse
objloader
cleditor
push-diffusion
vst
rstudio-server
aescryptoserviceprovider
visible
zurb-foundation-apps
freetype2
gemini
qt-linguist
brackets-shell
schtasks.exe
ninject.web.mvc
ibm-data-studio
google-hadoop
code-first-migrations
low-level
client-side-validation
belongs-to
candidate-key
tcpreplay
omnifocus
ember-components
ghostdoc
funkload
sqljdbc
angulartics
facebook-sdk-3.1
multiple-login
haskell-platform
entity-framework-4.1
manage.py
digital-design
marmalade-edk
deploying
law-of-demeter
handwriting
charts4j
paintcomponent
adobe-contribute
hibernate3-maven-plugin
callgrind
kpi
hmacsha1
internals
opcodes
gears
database-diagramming

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App