Änderungen von Web Scraping.md

1

+

Have you ever wanted to get a specific data from another website but there's no API available for it?

2

+

That's where Web Scraping comes in, if the data is not made available by the website we can just scrape it from the website itself.

3

+

4

+

But before we dive in let us first define what web scraping is. According to [Wikipedia](http://en.wikipedia.org/wiki/Web_scraping):

5

+

6

+

{% blockquote %}

7

+

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

8

+

{% endblockquote %}

9

+

10

+

11

+

So yes, web scraping lets us extract information from websites.

12

+

But the thing is there are some legal issues regarding web scraping.

13

+

Some consider it as an act of trespassing to the website where you are scraping the data from.

14

+

That's why it is wise to read the terms of service of the specific website that you want to scrape because you might be doing something illegal without knowing it.

15

+

You can read more about it in this [Wikipedia page](http://en.wikipedia.org/wiki/Web_scraping).

16

+

17

+

18

+

##Web Scraping Techniques

19

+

20

+

There are many techniques in web scraping as mentioned in the Wikipedia page earlier.

21

+

But I will only discuss the following:

22

+

23

+

- Document Parsing

24

+

- Regular Expressions

25

+

26

+

27

+

###Document Parsing

28

+

29

+

Document parsing is the process of converting HTML into DOM (Document Object Model) in which we can traverse through.

30

+

Here's an example on how we can scrape data from a public website:

31

+

32

+

```php

33

+

<?php

34

+

$html = file_get_contents('http://pokemondb.net/evolution'); //get the html returned from the following url

35

+

36

+

$pokemon_doc = new DOMDocument();

37

+

38

+

libxml_use_internal_errors(TRUE); //disable libxml errors

39

+

40

+

if(!empty($html)){ //if any html is actually returned

41

+

42

+

$pokemon_doc->loadHTML($html);

43

+

libxml_clear_errors(); //remove errors for yucky html

44

+

45

+

$pokemon_xpath = new DOMXPath($pokemon_doc);

46

+

47

+

//get all the h2's with an id

48

+

$pokemon_row = $pokemon_xpath->query('//h2[@id]');

49

+

50

+

if($pokemon_row->length > 0){

51

+

foreach($pokemon_row as $row){

52

+

echo $row->nodeValue . "<br/>";

53

+

}

54

+

}

55

+

}

56

+

?>

57

+

```

58

+

59

+

What we did with the code above was to get the html returned from the url of the website that we want to scrape.

60

+

In this case the website is [pokemondb.net](http://pokemondb.net).

61

+

62

+

```

63

+

<?php

64

+

$html = file_get_contents('http://pokemondb.net/evolution');

65

+

?>

66

+

```

67

+

68

+

Then we declare a new DOM Document, this is used for converting the html string returned from `file_get_contents` into an actual Document Object Model which we can traverse through:

69

+

70

+

```

71

+

<?php

72

+

$pokemon_doc = new DOMDocument();

73

+

?>

74

+

```

75

+

76

+

Then we disable libxml errors so that they won't be outputted on the screen, instead they will be buffered and stored:

77

+

78

+

```

79

+

<?php

80

+

libxml_use_internal_errors(TRUE); //disable libxml errors

81

+

?>

82

+

```

83

+

84

+

Next we check if there's an actual html that has been returned:

85

+

86

+

```

87

+

<?php

88

+

if(!empty($html)){ //if any html is actually returned

89

+

}

90

+

?>

91

+

```

92

+

93

+

Next we use the `loadHTML()` function from the new instance of `DOMDocument` that we created earlier to load the html that was returned. Simply use the html that was returned as the argument:

94

+

95

+

```

96

+

<?php

97

+

$pokemon_doc->loadHTML($html);

98

+

?>

99

+

```

100

+

101

+

Then we clear the errors if any. Most of the time yucky html causes these errors. Examples of yucky html are inline styling (style attributes embedded in elements), invalid attributes and invalid elements. Elements and attributes are considered invalid if they are not part of the HTML specification for the doctype used in the specific page.

102

+

103

+

```

104

+

<?php

105

+

libxml_clear_errors(); //remove errors for yucky html

106

+

?>

107

+

```

108

+

109

+

Next we declare a new instance of `DOMXpath`. This allows us to do some queries with the DOM Document that we created.

110

+

This requires an instance of the DOM Document as its argument.

111

+

112

+

```

113

+

<?php

114

+

$pokemon_xpath = new DOMXPath($pokemon_doc);

115

+

?>

116

+

```

117

+

118

+

Finally, we simply write the query for the specific elements that we want to get. If you have used [jQuery](http://jquery.com/) before then this process is similar to what you do when you select elements from the DOM.

119

+

What were selecting here is all the h2 tags which has an id, we make the location of the h2 unspecific by using double slashes `//` right before the element that we want to select. The value of the id also doesn't matter as long as there's an id then it will get selected. The `nodeValue` attribute contains the text inside the h2 that was selected.

120

+

121

+

```

122

+

<?php

123

+

//get all the h2's with an id

124

+

$pokemon_row = $pokemon_xpath->query('//h2[@id]');

125

+

126

+

if($pokemon_row->length > 0){

127

+

foreach($pokemon_row as $row){

128

+

echo $row->nodeValue . "<br/>";

129

+

}

130

+

}

131

+

?>

132

+

```

133

+

134

+

This results to the following text printed out in the screen:

135

+

136

+

```

137

+

Generation 1 - Red, Blue, Yellow

138

+

Generation 2 - Gold, Silver, Crystal

139

+

Generation 3 - Ruby, Sapphire, Emerald

140

+

Generation 4 - Diamond, Pearl, Platinum

141

+

Generation 5 - Black, White, Black 2, White 2

142

+

```

143

+

144

+

Let's do one more example with the document parsing before we move on to regular expressions.

145

+

This time were going to get a list of all pokemons along with their specific type (E.g Fire, Grass, Water).

146

+

147

+

First let's examine what we have on pokemondb.net/evolution so that we know what particular element to query.

148

+

149

+

![checking](/images/posts/getting_started_with_web_scraping/check.png)

150

+

151

+

As you can see from the screenshot, the information that we want to get is contained within a span element with a class of `infocard-tall `. Yes, the space there is included. When using XPath to query spaces are included if they are present, otherwise it wouldn't work.

152

+

153

+

Converting what we know into actual query, we come up with this:

154

+

155

+

```

156

+

//span[@class="infocard-tall "]

157

+

```

158

+

159

+

This selects all the span elements which has a class of `infocard-tall `. It doesn't matter where in the document the span is because we used the double forward slash before the actual element.

160

+

161

+

Once were inside the span we have to get to the actual elements which directly contains the data that we want. And that is the name and the type of the pokemon. As you can see from the screenshot below the name of the pokemon is directly contained within an `anchor` element with a class of `ent-name`. And the types are stored within a `small` element with a class of `aside`.

162

+

163

+

![info card](/images/posts/getting_started_with_web_scraping/info-card.png)

164

+

165

+

We can then use that knowledge to come up with the following code:

166

+

167

+

```

168

+

<?php

169

+

$pokemon_list = array();

170

+

171

+

$pokemon_and_type = $pokemon_xpath->query('//span[@class="infocard-tall "]');

172

+

173

+

if($pokemon_and_type->length > 0){

174

+

175

+

//loop through all the pokemons

176

+

foreach($pokemon_and_type as $pat){

177

+

178

+

//get the name of the pokemon

179

+

$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;

180

+

181

+

$pkmn_types = array(); //reset $pkmn_types for each pokemon

182

+

$types = $pokemon_xpath->query('small[@class="aside"]/a', $pat);

183

+

184

+

//loop through all the types and store them in the $pkmn_types array

185

+

foreach($types as $type){

186

+

$pkmn_types[] = $type->nodeValue; //the pokemon type

187

+

}

188

+

189

+

//store the data in the $pokemon_list array

190

+

$pokemon_list[] = array('name' => $name, 'types' => $pkmn_types);

191

+

192

+

}

193

+

}

194

+

195

+

//output what we have

196

+

echo "<pre>";

197

+

print_r($pokemon_list);

198

+

echo "</pre>";

199

+

?>

200

+

```

201

+

202

+

There's nothing new with the code that we have above except for using query inside the `foreach` loop.

203

+

We use this particular line of code to get the name of the pokemon, you might notice that we specified a second argument when we used the `query` method. The second argument is the current row, we use it to specify the scope of the query. This means that were limiting the scope of the query to that of the current row.

204

+

205

+

```

206

+

<?php

207

+

$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;

208

+

?>

209

+

```

210

+

211

+

212

+

The results would be something like this:

213

+

214

+

```

215

+

Array

216

+

(

217

+

[0] => Array

218

+

(

219

+

[name] => Bulbasaur

220

+

[types] => Array

221

+

(

222

+

[0] => Grass

223

+

[1] => Poison

224

+

)

225

+

)

226

+

[1] => Array

227

+

(

228

+

[name] => Ivysaur

229

+

[types] => Array

230

+

(

231

+

[0] => Grass

232

+

[1] => Poison

233

+

)

234

+

)

235

+

[2] => Array

236

+

(

237

+

[name] => Venusaur

238

+

[types] => Array

239

+

(

240

+

[0] => Grass

241

+

[1] => Poison

242

+

)

243

+

)

244

+

```

245

+

246

+

247

+

###Regular Expressions

248

+

249

+

250

+

251

+

252

+

##Web Scraping Tools

253

+

254

+

###Simple HTML Dom

255

+

256

+

To make web scraping easier you can use libraries such as simple html DOM.

257

+

Here's an example of getting the names of the pokemon using simple html DOM:

258

+

259

+

```

260

+

<?php

261

+

$html = file_get_html('http://pokemondb.net/evolution');

262

+

263

+

foreach($html->find('a[class=ent-name]') as $element){

264

+

echo $element->innertext . '<br>'; //outputs bulbasaur, ivysaur, etc...

265

+

}

266

+

?>

267

+

```

268

+

269

+

The syntax is more simple so the code that you have to write is lesser plus there are also some convenience functions and attributes which you can use. An example is the plaintext attribute which extracts all the text from a web page:

270

+

271

+

```

272

+

<?php

273

+

echo file_get_html('http://pokemondb.net/evolution')->plaintext;

274

+

?>

275

+

```

276

+

277

+

###Ganon

278

+

279

+

##Scraping non-public parts of website

280

+

281

+

282

+

###Scraping Amazon

283

+

284

+

285

+

286

+

- [Curl](http://curl.haxx.se/)

287

+

- [Simple HTML Dom](http://simplehtmldom.sourceforge.net/)

288

+

- [Ganon](https://code.google.com/p/ganon/)

289

+

290

+

291

+

##Resources

292

+

293

+

- [I don't need no stinking API: Web Scraping for fun and profit](http://blog.hartleybrody.com/web-scraping/)

294

+

- [Web scraping is actually pretty easy](http://blog.webspecies.co.uk/2011-07-27/web-scrapping-is-actually-pretty-easy.html)

295

+

- [Web scraping or API](https://news.ycombinator.com/item?id=4893922)

Erreur32 / Web Scraping.md

Erreur32 hat die Gist bearbeitet 6 months ago. Zu Änderung gehen

Wern Ancheta hat die Gist bearbeitet 13 years ago. Zu Änderung gehen

		@@ -0,0 +1,295 @@
1	+	Have you ever wanted to get a specific data from another website but there's no API available for it?
2	+	That's where Web Scraping comes in, if the data is not made available by the website we can just scrape it from the website itself.
3	+
4	+	But before we dive in let us first define what web scraping is. According to [Wikipedia](http://en.wikipedia.org/wiki/Web_scraping):
5	+
6	+	{% blockquote %}
7	+	Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.
8	+	{% endblockquote %}
9	+
10	+
11	+	So yes, web scraping lets us extract information from websites.
12	+	But the thing is there are some legal issues regarding web scraping.
13	+	Some consider it as an act of trespassing to the website where you are scraping the data from.
14	+	That's why it is wise to read the terms of service of the specific website that you want to scrape because you might be doing something illegal without knowing it.
15	+	You can read more about it in this [Wikipedia page](http://en.wikipedia.org/wiki/Web_scraping).
16	+
17	+
18	+	##Web Scraping Techniques
19	+
20	+	There are many techniques in web scraping as mentioned in the Wikipedia page earlier.
21	+	But I will only discuss the following:
22	+
23	+	- Document Parsing
24	+	- Regular Expressions
25	+
26	+
27	+	###Document Parsing
28	+
29	+	Document parsing is the process of converting HTML into DOM (Document Object Model) in which we can traverse through.
30	+	Here's an example on how we can scrape data from a public website:
31	+
32	+	```php
33	+	<?php
34	+	$html = file_get_contents('http://pokemondb.net/evolution'); //get the html returned from the following url
35	+
36	+	$pokemon_doc = new DOMDocument();
37	+
38	+	libxml_use_internal_errors(TRUE); //disable libxml errors
39	+
40	+	if(!empty($html)){ //if any html is actually returned
41	+
42	+	$pokemon_doc->loadHTML($html);
43	+	libxml_clear_errors(); //remove errors for yucky html
44	+
45	+	$pokemon_xpath = new DOMXPath($pokemon_doc);
46	+
47	+	//get all the h2's with an id
48	+	$pokemon_row = $pokemon_xpath->query('//h2[@id]');
49	+
50	+	if($pokemon_row->length > 0){
51	+	foreach($pokemon_row as $row){
52	+	echo $row->nodeValue . "<br/>";
53	+	}
54	+	}
55	+	}
56	+	?>
57	+	```
58	+
59	+	What we did with the code above was to get the html returned from the url of the website that we want to scrape.
60	+	In this case the website is [pokemondb.net](http://pokemondb.net).
61	+
62	+	```
63	+	<?php
64	+	$html = file_get_contents('http://pokemondb.net/evolution');
65	+	?>
66	+	```
67	+
68	+	Then we declare a new DOM Document, this is used for converting the html string returned from `file_get_contents` into an actual Document Object Model which we can traverse through:
69	+
70	+	```
71	+	<?php
72	+	$pokemon_doc = new DOMDocument();
73	+	?>
74	+	```
75	+
76	+	Then we disable libxml errors so that they won't be outputted on the screen, instead they will be buffered and stored:
77	+
78	+	```
79	+	<?php
80	+	libxml_use_internal_errors(TRUE); //disable libxml errors
81	+	?>
82	+	```
83	+
84	+	Next we check if there's an actual html that has been returned:
85	+
86	+	```
87	+	<?php
88	+	if(!empty($html)){ //if any html is actually returned
89	+	}
90	+	?>
91	+	```
92	+
93	+	Next we use the `loadHTML()` function from the new instance of `DOMDocument` that we created earlier to load the html that was returned. Simply use the html that was returned as the argument:
94	+
95	+	```
96	+	<?php
97	+	$pokemon_doc->loadHTML($html);
98	+	?>
99	+	```
100	+
101	+	Then we clear the errors if any. Most of the time yucky html causes these errors. Examples of yucky html are inline styling (style attributes embedded in elements), invalid attributes and invalid elements. Elements and attributes are considered invalid if they are not part of the HTML specification for the doctype used in the specific page.
102	+
103	+	```
104	+	<?php
105	+	libxml_clear_errors(); //remove errors for yucky html
106	+	?>
107	+	```
108	+
109	+	Next we declare a new instance of `DOMXpath`. This allows us to do some queries with the DOM Document that we created.
110	+	This requires an instance of the DOM Document as its argument.
111	+
112	+	```
113	+	<?php
114	+	$pokemon_xpath = new DOMXPath($pokemon_doc);
115	+	?>
116	+	```
117	+
118	+	Finally, we simply write the query for the specific elements that we want to get. If you have used [jQuery](http://jquery.com/) before then this process is similar to what you do when you select elements from the DOM.
119	+	What were selecting here is all the h2 tags which has an id, we make the location of the h2 unspecific by using double slashes `//` right before the element that we want to select. The value of the id also doesn't matter as long as there's an id then it will get selected. The `nodeValue` attribute contains the text inside the h2 that was selected.
120	+
121	+	```
122	+	<?php
123	+	//get all the h2's with an id
124	+	$pokemon_row = $pokemon_xpath->query('//h2[@id]');
125	+
126	+	if($pokemon_row->length > 0){
127	+	foreach($pokemon_row as $row){
128	+	echo $row->nodeValue . "<br/>";
129	+	}
130	+	}
131	+	?>
132	+	```
133	+
134	+	This results to the following text printed out in the screen:
135	+
136	+	```
137	+	Generation 1 - Red, Blue, Yellow
138	+	Generation 2 - Gold, Silver, Crystal
139	+	Generation 3 - Ruby, Sapphire, Emerald
140	+	Generation 4 - Diamond, Pearl, Platinum
141	+	Generation 5 - Black, White, Black 2, White 2
142	+	```
143	+
144	+	Let's do one more example with the document parsing before we move on to regular expressions.
145	+	This time were going to get a list of all pokemons along with their specific type (E.g Fire, Grass, Water).
146	+
147	+	First let's examine what we have on pokemondb.net/evolution so that we know what particular element to query.
148	+
149	+	![checking](/images/posts/getting_started_with_web_scraping/check.png)
150	+
151	+	As you can see from the screenshot, the information that we want to get is contained within a span element with a class of `infocard-tall `. Yes, the space there is included. When using XPath to query spaces are included if they are present, otherwise it wouldn't work.
152	+
153	+	Converting what we know into actual query, we come up with this:
154	+
155	+	```
156	+	//span[@class="infocard-tall "]
157	+	```
158	+
159	+	This selects all the span elements which has a class of `infocard-tall `. It doesn't matter where in the document the span is because we used the double forward slash before the actual element.
160	+
161	+	Once were inside the span we have to get to the actual elements which directly contains the data that we want. And that is the name and the type of the pokemon. As you can see from the screenshot below the name of the pokemon is directly contained within an `anchor` element with a class of `ent-name`. And the types are stored within a `small` element with a class of `aside`.
162	+
163	+	![info card](/images/posts/getting_started_with_web_scraping/info-card.png)
164	+
165	+	We can then use that knowledge to come up with the following code:
166	+
167	+	```
168	+	<?php
169	+	$pokemon_list = array();
170	+
171	+	$pokemon_and_type = $pokemon_xpath->query('//span[@class="infocard-tall "]');
172	+
173	+	if($pokemon_and_type->length > 0){
174	+
175	+	//loop through all the pokemons
176	+	foreach($pokemon_and_type as $pat){
177	+
178	+	//get the name of the pokemon
179	+	$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
180	+
181	+	$pkmn_types = array(); //reset $pkmn_types for each pokemon
182	+	$types = $pokemon_xpath->query('small[@class="aside"]/a', $pat);
183	+
184	+	//loop through all the types and store them in the $pkmn_types array
185	+	foreach($types as $type){
186	+	$pkmn_types[] = $type->nodeValue; //the pokemon type
187	+	}
188	+
189	+	//store the data in the $pokemon_list array
190	+	$pokemon_list[] = array('name' => $name, 'types' => $pkmn_types);
191	+
192	+	}
193	+	}
194	+
195	+	//output what we have
196	+	echo "<pre>";
197	+	print_r($pokemon_list);
198	+	echo "</pre>";
199	+	?>
200	+	```
201	+
202	+	There's nothing new with the code that we have above except for using query inside the `foreach` loop.
203	+	We use this particular line of code to get the name of the pokemon, you might notice that we specified a second argument when we used the `query` method. The second argument is the current row, we use it to specify the scope of the query. This means that were limiting the scope of the query to that of the current row.
204	+
205	+	```
206	+	<?php
207	+	$name = $pokemon_xpath->query('a[@class="ent-name"]', $pat)->item(0)->nodeValue;
208	+	?>
209	+	```
210	+
211	+
212	+	The results would be something like this:
213	+
214	+	```
215	+	Array
216	+	(
217	+	[0] => Array
218	+	(
219	+	[name] => Bulbasaur
220	+	[types] => Array
221	+	(
222	+	[0] => Grass
223	+	[1] => Poison
224	+	)
225	+	)
226	+	[1] => Array
227	+	(
228	+	[name] => Ivysaur
229	+	[types] => Array
230	+	(
231	+	[0] => Grass
232	+	[1] => Poison
233	+	)
234	+	)
235	+	[2] => Array
236	+	(
237	+	[name] => Venusaur
238	+	[types] => Array
239	+	(
240	+	[0] => Grass
241	+	[1] => Poison
242	+	)
243	+	)
244	+	```
245	+
246	+
247	+	###Regular Expressions
248	+
249	+
250	+
251	+
252	+	##Web Scraping Tools
253	+
254	+	###Simple HTML Dom
255	+
256	+	To make web scraping easier you can use libraries such as simple html DOM.
257	+	Here's an example of getting the names of the pokemon using simple html DOM:
258	+
259	+	```
260	+	<?php
261	+	$html = file_get_html('http://pokemondb.net/evolution');
262	+
263	+	foreach($html->find('a[class=ent-name]') as $element){
264	+	echo $element->innertext . '<br>'; //outputs bulbasaur, ivysaur, etc...
265	+	}
266	+	?>
267	+	```
268	+
269	+	The syntax is more simple so the code that you have to write is lesser plus there are also some convenience functions and attributes which you can use. An example is the plaintext attribute which extracts all the text from a web page:
270	+
271	+	```
272	+	<?php
273	+	echo file_get_html('http://pokemondb.net/evolution')->plaintext;
274	+	?>
275	+	```
276	+
277	+	###Ganon
278	+
279	+	##Scraping non-public parts of website
280	+
281	+
282	+	###Scraping Amazon
283	+
284	+
285	+
286	+	- [Curl](http://curl.haxx.se/)
287	+	- [Simple HTML Dom](http://simplehtmldom.sourceforge.net/)
288	+	- [Ganon](https://code.google.com/p/ganon/)
289	+
290	+
291	+	##Resources
292	+
293	+	- [I don't need no stinking API: Web Scraping for fun and profit](http://blog.hartleybrody.com/web-scraping/)
294	+	- [Web scraping is actually pretty easy](http://blog.webspecies.co.uk/2011-07-27/web-scrapping-is-actually-pretty-easy.html)
295	+	- [Web scraping or API](https://news.ycombinator.com/item?id=4893922)