利用R语言抓取网页表格数据实例

最近喜欢上了R,非常强大的统计语言。于是这里想试着用R抓取一个网页数据,这里就用Bigboss的开发者tweak下载记录作为实验对象了。(optimo桑,我只是个人试验一下啦)

我们需要两个类来完成网页的抓取与解析。这次就直接在RConsole里面安装吧。

先选择一个CRAN的镜像源:

chooseCRANmirror()

然后安装我们所需要的那两个包:

install.packages("RCurl")
install.packages("XML")

安装好这两个包之后,就可以开始正题了!

先载入依赖包,前两个是为了获取网页代码和解析,后面4个是为了绘图报告:

require(RCurl)
require(XML)

require(datasets)
require(grDevices)
require(graphics)
library(showtext)

接下来是获取Bigboss的tweak下载量的HTML源代码:

htmlCode <- getURL("http://apt.thebigboss.org/stats.php?dev=DEVELOPER_NAME")
htmlCode <- readLines(tmp <- textConnection(webpage));
close(tmp)

然后以HTML方式解析:

HTMLDOM <- htmlTreeParse(htmlCode, error=function(...){}, useInternalNodes = TRUE)

接着去拿表格数据,当然这里就用XML查询了:

download <- xpathSApply(HTMLDOM,"//table//td",xmlValue)

需要注意的是,得到的download变量的第一个元素是"Downloads for DEVELOPER_NAME"。

为了拿到tweak的名字,我们从下标1开始,到末尾就行,然后需要把它转为数字之后赋给name(pie.sales)。拿下载量也是类似的方法。

pie.sales <- as.numeric(download[(seq(from = 3, to = length(download))) %% 4 == 1])
names(pie.sales) <- download[(seq(from = 1, to = length(download))) %% 4 == 2]

最后调用pie画出图像就行了~

pie(pie.sales, col = rainbow(length(download)), edges = 400, radius = 1)

以下就是完整的实现~

Code

require(RCurl)
require(XML)
# Get HTML Content and Parse HTML Tree
 

require(datasets)
require(grDevices)
require(graphics)
library(showtext)
# Drawing
 
Developer<-"YOUR_DEVELOPER_NAME_ON_BIGBOSS"
# Read Developer Name
 
htmlCode<- getURL( paste("http://apt.thebigboss.org/stats.php?dev=",Developer,sep ="") )
htmlCode <- readLines(tmp <- textConnection(htmlCode));

close(tmp)
# Get HTML Content
 

HTMLDOM <- htmlTreeParse(htmlCode, error = function(...){}, useInternalNodes = TRUE)
# Parse HTML Tree
 
download<- xpathSApply(HTMLDOM,"//table//td",xmlValue)
# X Path Inquiry
 
font.add("Kaiti","Kaiti.ttc")
plot.new()

showtext.begin()
# Using Kaiti For Drawing
 

pie.sales <- as.numeric(download[(seq(from = 3, to = length(download))) %% 4 == 1])

names(pie.sales) <- download[(seq(from = 1, to = length(download))) %% 4 == 2]
# Set Tweak Name and Downloads Count
 

pie(pie.sales, col = rainbow(length(pie.sales)), edges = 400, radius = 0.8)

title(main = Developer, cex.main = 1.4, font.main = 3, family="Kaiti")

title(xlab = "Bigboss Downloads Count", cex.lab = 0.8, font.lab = 3,family="Kaiti")
# Draw!
 

showtext.end()
# End

Leave a Reply

Your email address will not be published. Required fields are marked *

17 − one =