python


BeautifulSoup find_all(“img”) not working for all sites


I'm trying to write a Python script to download images from any website. It is working, but inconsistently. Specifically, find_all("img") is not doing so for the second url. The script is:
# works for http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/
# but not http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup
def url_to_image(url, filename):
# get HTTP response, open as bytes, save the image
# http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
req = requests.get(url)
i = Image.open(BytesIO(req.content))
i.save(filename)
# open page, get HTML request and parse with BeautifulSoup
html = requests.get("http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/")
soup = BeautifulSoup(html.text, "html.parser")
# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
if img["src"].endswith("jpg"):
print("endswith jpg")
urls.append(str(img["src"]))
print(str(img))
jpeg_no = 00
for url in urls:
url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
jpeg_no += 1
The images are rendered with JavaScript on the page that is failing.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
e.g.
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup
import dryscrape
def url_to_image(url, filename):
# get HTTP response, open as bytes, save the image
# http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
req = requests.get(url)
i = Image.open(BytesIO(req.content))
i.save(filename)
# open page, get HTML request and parse with BeautifulSoup
session = dryscrape.Session()
session.visit("http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/")
response = session.body()
soup = BeautifulSoup(response, "html.parser")
# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
if img["src"].endswith("jpg"):
print("endswith jpg")
urls.append(str(img["src"]))
print(str(img))
jpeg_no = 00
for url in urls:
url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
jpeg_no += 1
But I would also check that you have an absolute URL not a relative one:
import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup
import dryscrape
from urllib.parse import urljoin
def url_to_image(url, filename):
# get HTTP response, open as bytes, save the image
# http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
req = requests.get(url)
i = Image.open(BytesIO(req.content))
i.save(filename)
# open page, get HTML request and parse with BeautifulSoup
base = "http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/"
session = dryscrape.Session()
session.visit(base)
response = session.body()
soup = BeautifulSoup(response, "html.parser")
# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
if img["src"].endswith("jpg"):
print("endswith jpg")
urls.append(str(img["src"]))
print(str(img))
jpeg_no = 00
for url in urls:
if url.startswith( 'http' ):
absoute = url
else:
absoute = urljoin(base, url)
print (absoute)
url_to_image(absoute, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
jpeg_no += 1

Related Links

How to extract internet email headers from outlook emails?
Tensorboard get blank page
Python Matplotlib Streamplot providing start points
Reduce inner points in numpy coordinate dataset (speed up concave hull)
Blitting several layers pygame
floating and integer power difference in Python
how to skip certain line in text file and keep reading the next line in python?
match key and insert into new column
Python - Storing the values of CSV file in List
Should telegram-code be defined as a message option?
I get translated text when I do a GET request (in Python). How to get English content?
Store two types of value in single Django model field
egg_info failed with error code 1
Open SAP with Python GUI Script win32com.client.Dispatch(“Sapgui.ScriptingCtrl.1”)
Python receive MySQL Decimal datatype as None
Panda .loc or .iloc to select the columns from a dataset

Categories

HOME
ibm-watson-cognitive
vs-team-services
extract
automated-tests
phonegap-cli
ext.net
subquery
onenote-api
facebook-javascript-sdk
powershell-v3.0
flexbox
dlib
cloudflare
ida
cross-platform
reduction
riak-ts
rhapsody
flat-file
ms-access-2007
os161
spring-test
novnc
tfs2013
visual-c++-2017
continuous-deployment
ui5
multichoiceitems
cortex-m3
superagent
vegan
pyscripter
hotmail
rhino
google-guava-cache
mms
xamarin.uitest
codesys
concrete5-5.7
agent
ical-dotnet
datalog
lftp
stress-testing
spring-data-hadoop
chrome-remote-desktop
ticker
vici
efxclipse
settimeout
easing
software-product-lines
dts
mongo-c-driver
galaxy
sonarqube5.3
drf-nested-routers
evo
rad
urn
pyopengl
crash-dumps
coypu
gulp-livereload
git-checkout
orientation-changes
java-melody
gyroscope-framework
qtestlib
node-inspector
ember-addon
argument-passing
bsp
windows-update
google-closure-library
adomd.net
resgen
web-safe-fonts
feof
getusermedia
gjs
applicationcontext
ihttphandler
gethashcode
zend-form-element
pantheios
requestfactory
camtasia
odbc-sql-server-driver
writing

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App