0%

centos自动谷歌搜关键词并收集URL

centos自动谷歌搜关键词并收集URL

QQ群:397745473

环境准备

需要安装centos 桌面环境,只需要在装系统的时候选上desktop就行了。

centos 7 安装RDP

参考:https://www.cnblogs.com/lenmom/p/9516210.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
yum install -y epel-release xrdp tigervnc-server tmux
vncpasswd root
yum -y update && yum -y upgrade

sed -i 's/max_bpp=32/max_bpp=24/g' /etc/xrdp/xrdp.ini
修改XRDP最大连接数,否则远程连接可能无法成功,把max_bpp=32, 改为max_bpp=24

sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config

systemctl stop firewalld
systemctl disable firewalld

firewall-cmd --permanent --zone=public --add-port=3389/tcp
firewall-cmd --reload

systemctl start xrdp

systemctl enable xrdp

systemctl enable sshd

yum -y install tmux
tmux new -s vsyour
tmux ls
tmux a -d -t vsyour

左右分屏:prefix + %
上下分屏:prefix + '

安装软件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
anaconda
参考:https://www.anaconda.com/
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh

pycharm
参考:https://www.cnblogs.com/niuli1987/p/9917650.html
wget https://download.jetbrains.com/python/pycharm-professional-2018.1.tar.gz
ln -s /root/pycharm-2018.1/bin/pycharm.sh /root/桌面/pycharm


用xfce4 桌面环境
yum install epel-release
yum groupinstall xfce4

执行 yum groupinstall xfce4安装Xfce4桌面环境。如果需要,可选安装xfce4的其他模块。
执行sudo systemctl isolate graphical.target,进入Xfce


yum groupinstall "X Window system"


解决问题

Centos xrdp 远程连接后突然闪退

由于anconda 与xrdp冲突所以重启后连接xrdp远程桌面时会出现闪退的现象,所以需要启动的时候把原来的anconda的注释掉,改成下面这样就可以了

供参考:vim ~/.bashrc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# .bashrc

# User specific aliases and functions

alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
. "/root/.acme.sh/acme.sh.env"

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
export PATH="$PATH:/root/anaconda3/bin"

#__conda_setup="$('/root/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
#if [ $? -eq 0 ]; then
# eval "$__conda_setup"
#else
# if [ -f "/root/anaconda3/etc/profile.d/conda.sh" ]; then
# . "/root/anaconda3/etc/profile.d/conda.sh"
# else
# #export PATH="/root/anaconda3/bin:$PATH"
# export PATH="$PATH:/root/anaconda3/bin"
# fi
#fi

参考:https://www.cnblogs.com/infoo/p/11239490.html
http://blog.sina.com.cn/s/blog_71bd750b010312q3.html

另一种无法进入桌面的情况

都没干啥 就又进不了桌面了….

于是又是一顿操作

1
2
3
4
5
6
参考:https://bugzilla.redhat.com/show_bug.cgi?id=1529419
abrt-auto-reporting enabled
1.Fresh install RHEL7.7alpha via mimi mode;
2.Install GUI via the command " yum groupinstall "Server with GUI" ",boot into GUI,it is not reproduce;
3.Update libX11(libX11-1.6.7-2.el7.x86_64) ,reboot and check the messages log,it still not reproduce.
重启后就恢复正常了

安装浏览器驱动

1
2
3
4
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -xvzf geckodriver-v0.23.0-linux64.tar.gz
chmod +x geckodriver
mv geckodriver /usr/local/bin/

安装输入法

​ 注意 装完后进不了桌面了,查了很久原因,原来是这个引起的。。。这一步最好就不操作了!!!

参考:

https://jingyan.baidu.com/article/cbf0e500b791142eaa28932f.html

1
2
3
4
yum remove ibus
yum install ibus ibus-table
yum install ibus ibus-table-wubi
这个操作完有问题,直接进不了桌面了

自动搜索代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# coding:utf-8
# pip install selenium -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
import requests
import json
import os
import time
import random
from selenium.webdriver.common.keys import Keys # 引入模块
from selenium.webdriver.common.action_chains import ActionChains

class GetGoogleUrl(object):
def __init__(self):
self.printInfo('开始获取关键词....')
self.keys=[]
for i in range(0,31):
pastTime = (datetime.datetime.now() - datetime.timedelta(days=i)).strftime('%Y%m%d')
url = f'https://trends.google.com/trends/api/dailytrends?hl=en-US&tz=0&ed={pastTime}&geo=US&ns=15'
try:
response = requests.get(url)
response.encoding = 'utf-8'
jdata = json.loads(response.text.strip().replace(')]}\',\n', ''))
for x in jdata['default']['trendingSearchesDays'][0]['trendingSearches']:
self.keys.append(x['title']['query'])
except Exception as e:
self.printInfo(f'获取关键词{url}出错,提示信息:{e}')

self.printInfo(f'使用URL {url} 当前词量 {len(self.keys)}个')
self.keys=set(self.keys)
self.printInfo(f'去重后还有{len(self.keys)}个')

firefox_options = webdriver.FirefoxOptions()
firefox_options.add_argument("--disable-infobars") # 设置禁用浏览器正在被自动化程序控制的提示
#firefox_options.add_argument('--headless') # 以所谓的headless模式打开chrome
firefox_options.add_argument("--incognito") # 设置无痕模式
self.Firefox_driver = webdriver.Firefox(options=firefox_options)

def printInfo(self,string):
nowTime = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') # 现在
printContent=f'[*] {nowTime} {string}'
print(printContent)

def getUrl(self):
try:
soup=BeautifulSoup(self.Firefox_driver.page_source,'lxml')
if soup:
for i in soup.find_all('div',attrs={'class':'r'}):
try:
writeUrl=i.a.get('href')
except Exception as e:
self.printInfo(f'{i} 获取url 失败,提示: {e}')
continue

if writeUrl:
with open('urls.txt','a') as f:
f.write(i.a.get('href')+'\n')
if soup.find('a',attrs={'id':'pnnext'}):
return True
else:
return False
except Exception as e:
self.printInfo(f'{self.Firefox_driver.current_url} 获取网站源码失败,提示: {e}')

return True


def seachKey(self,key):
self.printInfo(f'正在搜索: {key}')
down = "var q=document.documentElement.scrollTop=100000"
try:
self.Firefox_driver.find_element_by_name("q").clear()
self.Firefox_driver.find_element_by_name("q").send_keys(key, Keys.RETURN)
time.sleep(random.randint(1, 60))
pageNumber=0

while True:
pageNumber+=1
# 获取内容
if not self.getUrl():
time.sleep(random.randint(1, 60))
self.printInfo(f'抓取{pageNumber}页完成!')
break
time.sleep(random.randint(1, 10))
self.Firefox_driver.execute_script(down) # 下拉
#self.Firefox_driver.execute_script('window.scrollTo(0,1000000)')
#self.Firefox_driver.execute_script(pnnext) # 翻页
ActionChains(self.Firefox_driver).move_to_element(self.Firefox_driver.find_element_by_id("pnnext")).perform() # 模拟鼠标移动
time.sleep(random.randint(1, 3))
self.Firefox_driver.find_element_by_id("pnnext").click() # 点击翻页
time.sleep(random.randint(1, 10))
except Exception as e:
self.printInfo(f'搜索失败,提示: {e}')
tipMessage=''' Message: Unable to clear element that cannot be edited: <input name="q" type="hidden"> '''
if tipMessage in e:
return 'exit'

def firefoxDriver(self, url):
time.sleep(random.randint(1,3)) # sleep一下,否则有可能报错
self.Firefox_driver.implicitly_wait(1)
self.Firefox_driver.get(url)
keyNumber=0
for key in self.keys:
keyNumber+=1
self.printInfo(f'[{str(keyNumber).zfill(3)}]'.center(100,'+'))

if self.seachKey(key) == 'exit':
return 'exit'

if __name__ == '__main__':
for _ in range(2):
getGoogleUrl = GetGoogleUrl()
if getGoogleUrl.firefoxDriver('https://www.google.com/ncr') == 'exit':
getGoogleUrl.Firefox_driver.quit()



去重代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#coding:utf-8
import glob
def fun_set(l):
return list(set(l))

allDomain=[]
for fileName in glob.glob('urls*.txt'):
with open(fileName,'r') as f:
allUrl=f.read().split('\n')
print(f'{fileName} 取到条 {len(allUrl)} URL.')
allDomain+=allUrl

allDomain_1=[]
for i in fun_set(allDomain):
data='/'.join(i.split('/')[2:3])
if data:
allDomain_1.append(data)
n=0
for i in fun_set(allDomain_1):
n+=1
print(n,i)

新发现

在windows 上安装过 TunSafe 或者 WireGuard 的客户端后,就能直接用cmd中打开ssh连接远程的linux电脑了。

QQ群:397745473

欢迎关注我的其它发布渠道