有用的Python项目收集

高效微信公众号历史文章和阅读数据爬虫

#git clone https://github.com/wonderfulsuccess/weixin_crawler
$ cd weixin_crawler/project/
#https://github.com/24OI/OI-wiki/
$ python(3) ./main.py

编程竞赛

git clone https://git.dev.tencent.com/scaffrey/OI-wiki.git -b coding-pages
# 如果是 python3
python3 -m http.server
# 如果是 python2
python2 -m SimpleHTTPServer 8888
# 有些环境下找不到名叫 python3/python2 的可执行文件，不妨运行 python 试试

全网帮你抓取信息

# clone the repo
$ git clone https://github.com/TheYahya/sherlock.git

# change the working directory to sherlock
$ cd sherlock

# install the requirements
$ pip3 install -r requirements.txt
$ python3 sherlock.py --help
usage: sherlock.py [-h] [--version] [--verbose] [--quiet] [--tor]
                   [--unique-tor] [--csv] [--site SITE_NAME]
                   USERNAMES [USERNAMES ...]

Sherlock: Find Usernames Across Social Networks (Version 0.2.0)

二维码

pip install qrcode
import qrcode

# 二维码内容
data = "https://www.baidu.com"
# 生成二维码
img = qrcode.make(data=data)
# 直接显示二维码
img.show()
# 保存二维码为文件
# img.save("baidu.jpg")
pip install zxing
import zxing

reader = zxing.BarCodeReader()
barcode = reader.decode("baidu.jpg")
print(barcode.parsed)

本地区块链

//https://learnku.com/articles/23196
pip install py-geth
>>> from geth import LiveGethProcess
>>> geth = LiveGethProcess()
>>> geth.start()
>>> geth = DevGethProcess('testing', '/tmp/some-other-base-dir/')
>>> geth.start()

虚拟环境安装

pip intsall virtualenv 
pip install virtualenvwrapper-win  //windows下
pip install virtualenvwrapper  //linux
mkvirtualenv *envName*  //创建一个虚拟环境 创建在用户家目录下
workenon *envName*  //激活环境
rmvirtualenv *envName* //删除环境
deactivate  //退出环境
worken //列出环境
需要进入虚拟环境中，执行pip install即可

网易云音乐下载

$ pip3 install pymusic-dl

$ music-dl --help
Usage: music-dl [OPTIONS]

  Search and download music from netease, qq, kugou, baidu and xiami.
  Example: music-dl -k "周杰伦"

Options:
  --version            Show the version and exit.
  -k, --keyword TEXT   搜索关键字
  -s, --source TEXT    数据源目前支持qq netease kugou baidu xiami flac
  -c, --count INTEGER  搜索数量限制
  -o, --outdir TEXT    指定输出目录
  -x, --proxy TEXT     指定代理（如http://127.0.0.1:1087）
  -m, --merge          对搜索结果去重和排序（默认不去重）
  -v, --verbose        详细模式
  --help               Show this message and exit.
  
  λ music-dl -k '周杰伦'
  
  Searching '周杰伦' from ... QQ ... KUGOU ... NETEASE ... BAIDU ... XIAMI ...
  ---------------------------
  
   [  0 ]      QQ | 0:03:35 -  3.28MB - 周杰伦 - 告白气球 - 周杰伦的床边故事
   [  1 ]      QQ | 0:03:59 -  3.66MB - 周杰伦 - 青花瓷 - 我很忙
   [  2 ]      QQ | 0:04:29 -  4.12MB - 周杰伦 - 晴天 - 叶惠美
   [  3 ]      QQ | 0:05:26 -  4.99MB - 周杰伦 - 轨迹 (伴奏) - 寻找周杰伦
   [  4 ]      QQ | 0:04:16 -  3.92MB - 周杰伦 - 说好的幸福呢 - 魔杰座
   [  5 ]   KUGOU | 0:03:35 -  3.28MB - 周杰伦 - 告白气球 - 周杰伦的床边故事
   [  6 ]   KUGOU | 0:04:29 -  4.12MB - 周杰伦 - 晴天 - 叶惠美
   [  7 ]   KUGOU | 0:03:57 -  3.62MB - 周杰伦 - 青花瓷 - 我很忙
   [  8 ]   KUGOU | 0:03:43 -  3.41MB - 周杰伦 - 稻香 - 魔杰座
   [  9 ]   KUGOU | 0:04:56 -  4.53MB - 周杰伦 - 不能说的秘密 - 不能说的秘密 电影原声带
   [ 10 ] NETEASE | 0:03:12 -  2.21MB - 周杰伦、李玟 - 刀马旦 - Partners 拍档
   [ 11 ]   BAIDU | 0:04:28 - 22.38MB - 山弟 - 周杰伦 - 周杰伦
   [ 12 ]   BAIDU | 0:03:35 -  3.29MB - 周杰伦 - 告白气球 - 周杰伦的床边故事
   [ 13 ]   BAIDU | 0:03:59 -  3.65MB - 周杰伦 - 青花瓷 - 我很忙
   [ 14 ]   BAIDU | 0:04:30 -  4.14MB - 周杰伦 - 简单爱 - 婚礼歌手 幸福情歌精选
   [ 15 ]   BAIDU | 0:03:46 -  3.46MB - 周杰伦 - 夜曲 - 十一月的萧邦
   [ 16 ]   XIAMI | 0:03:40 -  3.61MB - 张学友 - 星晴 (Live) - 活出生命Live演唱会
   [ 17 ]   XIAMI | 0:03:15 -  2.98MB - Abby - 周杰伦香水广告BGM - 周杰伦香水广告
   [ 18 ]   XIAMI | 0:04:08 -   9.5MB - G.E.M.邓紫棋 - 龙卷风 - 龙卷风
   [ 19 ]   XIAMI | 0:04:34 -  4.19MB - 鲁瑾 - 周杰伦：夜曲 - 二十年的青春和往昔第五季 这一刻你依然如此动人
   [ 20 ]   XIAMI | 0:03:16 -  2.99MB - Abby - 周杰伦香水广告单曲（升key） - 周杰伦香水广告
  
  ---------------------------
  请输入下载序号，支持形如 0 3-5 8 的格式，输入 N 跳过下载

股票行情

pip instasll tushare

import tushare as ts
ts.set_token('你的token')
pro = ts.pro_api()
data = pro.stock_basic(exchange_id='', is_hs='', fields='symbol,name,is_hs,list_date,list_status')
print(data)
# ''表示获取全部https://www.makcyun.top/python_data_analysis01.html 
data = data.set_index(data['list_date'])
data = data['2017']
print(data.head())
# 结果
              ts_code  symbol  name list_status  list_date is_hs
list_date                                                       
2017-12-25  001965.SZ  001965  招商公路           L 2017-12-25     S
2017-03-24  002774.SZ  002774  快意电梯           L 2017-03-24     N
2017-01-12  002824.SZ  002824  和胜股份           L 2017-01-12     N
2017-01-06  002838.SZ  002838  道恩股份           L 2017-01-06     N
2017-01-24  002839.SZ  002839  张家港行           L 2017-01-24     S

使用python生成微信好友地域分析

# -*- coding: UTF-8 -*-
from wxpy import *

from wxpy import *
from os import path
import re, jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud,ImageColorGenerator
import matplotlib.font_manager as fm
# 初始化一个机器人对象https://zhuanlan.zhihu.com/p/60723803
# cache_path缓存路径，给定值为第一次登录生成的缓存文件路径
bot = Bot()
#获取好友列表(包括自己)
my_friends = bot.friends(update=False)
'''
stats_text 函数：帮助我们简单统计微信好友基本信息
简单的统计结果的文本
    :param total: 总体数量
    :param sex: 性别分布
    :param top_provinces: 省份分布
    :param top_cities: 城市分布
    :return: 统计结果文本
'''
print(my_friends.stats_text())
# 清洗数据，生成词云图
#获取当前的项目文件加的路径
#读取停用词表
stopwords_path='static/stopwords.txt'

#定义个函数式用于分词
def jiebaclearText(text):
    #定义一个空的列表，将去除的停用词的分词保存
    mywordList=[]
    #进行分词
    seg_list=jieba.cut(text,cut_all=False)
    #将一个generator的内容用/连接
    listStr='/'.join(seg_list)
    listStr = listStr.replace("class","")
    listStr = listStr.replace("span", "")
    listStr = listStr.replace("emoji", "")
    #打开停用词表
    f_stop=open(stopwords_path,encoding="utf8")
    #读取
    try:
        f_stop_text=f_stop.read()
    finally:
        f_stop.close()#关闭资源
    #将停用词格式化，用\n分开，返回一个列表
    f_stop_seg_list=f_stop_text.split("\n")
    #对默认模式分词的进行遍历，去除停用词
    for myword in listStr.split('/'):
        #去除停用词
        if not(myword.split()) in f_stop_seg_list and len(myword.strip())>1:
            mywordList.append(myword)
    return ' '.join(mywordList)
# 生成词云图
def make_wordcloud(text1,i):
	bg = plt.imread(r"image/heart.jpg")
	# 生成
	wc = WordCloud(# FFFAE3
		background_color="#FFFFFF",  # 设置背景为白色，默认为黑色
		width=990,  # 设置图片的宽度
		height=440,  # 设置图片的高度
		mask=bg,
		margin=10,  # 设置图片的边缘
		max_font_size=70,  # 显示的最大的字体大小
		random_state=20,  # 为每个单词返回一个PIL颜色
		font_path='static/simkai.ttf'  # 中文处理，用系统自带的字体
	).generate(text1)
	# 为图片设置字体
	my_font = fm.FontProperties(fname='static/simkai.ttf')
	# 图片背景
	bg_color = ImageColorGenerator(bg)
	# 开始画图
	plt.imshow(wc.recolor(color_func=bg_color))
	# 为云图去掉坐标轴
	plt.axis("off")
	# 画云图，显示
	# 保存云图
	wc.to_file(r"image/render_0%d.png"%i)
# 微信昵称
nick_name = ''
# 微信个性签名
wx_signature = ''
for friend in my_friends:
	# 微信昵称：NickName
	nick_name = nick_name + friend.raw['NickName']
	# 个性签名：Signature
	wx_signature = wx_signature + friend.raw['Signature']

nick_name = jiebaclearText(nick_name)
wx_signature = jiebaclearText(wx_signature)
make_wordcloud(nick_name,1)
make_wordcloud(wx_signature,2)

检测微信好友是否把你删除

通过Pyhton登录网页版微信，给你所有好友发送 జ్ఞా 这个特殊符号，由于微信的bug，对方收不到这个特殊符号，只要有人删了你，你的微信对话页面就会显示对方已经把你删除发送失败。
pip install itchat
import itchat
import time
  
  
itchat.auto_login(hotReload=True) # 热加载
  
print('检测结果可能会引起不适。')
print('检测结果请在手机上查看，此处仅显示检测信息。')
print('消息被拒收为被拉黑， 需要发送验证信息为被删。')
print('没有结果就是好结果。')
print('检测1000位好友需要34分钟， 以此类推。')
print('为了你的账号安全着想，这个速度刚好。')
print('在程序运行期间请让程序保持运行，网络保持连接。')
print('请不要从手机端手动退出。')
input('按ENTER键继续...')
  
  
friends = itchat.get_friends(update=True)
lenght = len(friends)
  
for i in range(1, lenght):
    # 微信bug，用自己账户给所有好友发送"ॣ ॣ ॣ"消息，当添加自己为好友时，只有自己能收到此信息，如果没添加自己为好友\
    # 没有人能收到此信息，笔者此刻日期为2019/1/6 8:30，到目前为止微信bug还没修复。
    # 所以迭代从除去自己后的第二位好友开始 range(1, lenght)。
    itchat.send("జ్ఞా", toUserName=friends[i]['UserName'])
    print(f'检测到第{i}位好友: {str(friends[i]["NickName"]).center(20, " ")}')
    # 发送信息速度过快会被微信检测到异常行为。https://www.qingwei.tech/programe-develops/python/1268.html
    time.sleep(2)
  
print('已检测完毕，请在手机端查看结果。')
  
  
itchat.run()

分析你的微信好友签名

import itchat
import re
import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
from scipy.misc import imread  # 这是一个处理图像的函数https://juejin.im/post/5ca4c7d7f265da30d34ba228

# -*- coding: utf-8 -*-

#导入模块
from wxpy import *

'''
微信机器人登录有3种模式，
(1)极简模式:robot = Bot()
(2)终端模式:robot = Bot(console_qr=True)
(3)缓存模式(可保持登录状态):robot = Bot(cache_path=True)
'''
#初始化机器人，选择缓存模式（扫码）登录
robot = Bot(cache_path=True)

#获取好友、群、公众号信息
robot.chats()

#获取好友的统计信息
Friends = robot.friends()
print(Friends.stats_text())


sign_list=[]
itchat.auto_login(hotReload=True)
itchat.send(u'Hello,world','filehelper')
friends = itchat.get_friends(update=True)[0:]
print('开始获取微信好友个性签名.....')
for i in friends:
    signature = i["Signature"].strip().replace("span", "").replace("class", "").replace("emoji", "") #过滤掉表情
    #rep = re.compile("< =.+/>")
    rep = re.compile("[^\u4e00-\u9fa5^]")
    nickName=i["NickName"]
    signature = rep.sub("", signature)
    sign_list.append(signature)

text=''.join(sign_list)
wordlist_jieba = jieba.cut(text, cut_all=True)
wl_space_split = ' '.join(wordlist_jieba)

back_color = imread('./mao.jpg')

# 词云
my_wordcloud = WordCloud(
    background_color='white',  # 背景颜色
    max_words=2000,  # 最大词数
    mask=back_color,  # 以该参数值作图绘制词云，这个参数不为空时，width和height会被忽略
    max_font_size=100,  # 显示字体的最大值
    font_path='C:/Windows/Fonts/simfang.ttf',  # 指定字体文件 解决显示口字型乱码问题，
    random_state=42,  # 为每个词返回一个PIL颜色
    )

# 用wl_space_split生成词云
my_wordcloud.generate(wl_space_split)

# 基于彩色图像 生成响应的色彩
image_colors = ImageColorGenerator(back_color)
# 显示图片
# plt.imshow(my_wordcloud)
# 关闭坐标轴
# plt.axis('off')
# 绘制词云
plt.figure()
plt.imshow(my_wordcloud.recolor(color_func=image_colors))
plt.axis('off')
# 保存图片
my_wordcloud.to_file('ciyun.png')

处理 Python 中的时区

pip install pytz
>>> from datetime import datetime, timedelta
>>> from pytz import timezone
>>> import pytz
>>> utc = pytz.utc
>>> utc.zone
'UTC'
>>> beijing = timezone('Asia/Shanghai')
>>> beijing.zone
'Asia/Shanghai'
>>> tokyo = timezone('Asia/Tokyo')
>>> tokyo.zone
'Asia/Tokyo'
第一种是使用pytz库提供的 localize() 方法。这用于本地化一个没有时区信息的日期时间:

>>> fmt = '%Y-%m-%d %H:%M:%S %Z%z'
>>> loc_dt = beijing.localize(datetime(2018, 10, 27, 6, 0, 0))
>>> print(loc_dt.strftime(fmt))
'2018-10-27 06:00:00 CST+0800'

第二种方法是使用标准 astimezone() 方法转换现有的本地化时间：

>>> jp_dt = loc_dt.astimezone(tokyo)
>>> jp_dt.strftime(fmt)
'2018-10-27 07:00:00 JST+0900'
处理时间的首选方法是始终以UTC工作，仅在生成输出以供人类读取时转换为本地时间:
>>> utc_dt = datetime(2018, 10, 27, 6, 0, 0, tzinfo=utc)
>>> loc_dt = utc_dt.astimezone(beijing)
>>> loc_dt.strftime(fmt)
'2018-10-27 14:00:00 CST+0800'
此库还允许使用本地时间进行日期算术，例如计算北京和东京的时差：
>>> timestamp = datetime.utcnow()
>>> dt_cn = beijing.localize(timestamp)
>>> dt_jp =tokyo.localize(timestamp)
>>> x = dt_cn - dt_jp
>>> int(x.total_seconds()/3600)
1

Python生成微信好友位置分布图

import itchat
from pyecharts import Map

#http://shuaiguoer.com/%E5%BE%AE%E4%BF%A1%E5%A5%BD%E5%8F%8B%E4%BD%8D%E7%BD%AE%E5%88%86%E5%B8%83%E5%9B%BE/
List = []
a = {}
name = []
value = []
# 登录微信
itchat.login()
# 获取所有好友信息
owner = itchat.get_friends()
# 获取所有好友的所在位置
for i in owner:
    province = i['Province']
    List.append(province)
# 获取每个位置对应的好友人数
for s in List:
    if List.count(s) >= 1:
        a[s] = List.count(s)
# 把去重后的位置添加到列表name中
for j in a:
    name.append(j)
# 把每个位置对应的好友人数添加到列表values中
for v in a.values():
    value.append(v)
# 生成地图
maps = Map('微信好友位置分布图', width=1500, height=900)     # 设置地图的宽和高
# 把数据添加到地图中
maps.add('', name, value, maptype='china', is_visualmap=True, visual_text_color='#000', is_label_show=True, visual_range=[0, 20])
# is_visualmap        --->    是否使用视觉映射组件
# visual_text_color   --->    两端文本颜色
# is_label_show       --->    是否正常显示标签。标签即各点的数据项信息
# visual_range        --->    指定允许的最小值与最大值
maps.render('微信好友位置分布图.html')       # 生成HTML文件

图灵机器人与指定的好友微信聊天

import itchat
import http.client
import json
 
# 监听人的微信id
touserNameId = '@becc4377a98a6df74fafc1192a3dd045'
fromuserId='d3aef9a75e2b51430683ffeb386f0564'
tulingDomain='openapi.tuling123.com'
tulingOpenapiUrl='http://'+ tulingDomain + '/openapi/api/v2'
# 聊天计数
count= 0
 
# 监听接收到的文件信息
@itchat.msg_register(itchat.content.TEXT)
def reply_msg(msg):
 
    print(msg)
    # 指定好友回复特定消息
    if msg['FromUserName'] == touserNameId and msg['ToUserName'] == touserNameId:
 
        global count
 
        # 图灵机器人
        robbot0 = 'appkey0'
        # 图灵机器人1
        robbot1 = 'appkey1'
        # 图灵机器人2
        robbot2 = 'appkey2'
        # 图灵机器人3
        robbot3 = 'appkey3'
        # 图灵机器人4
        robbot4 = 'appkey4'
 
        count += 1
        temp= 'robbot'+ str(count//98)
        usedRobbot =locals()[temp]
        print("收到：", msg.text)
        info = msg.text
 
 
        headers = {
            # heard部分直接通过chrome部分request header部分
            'Accept': 'application/json, text/plain, */*',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.8',
            'Connection': 'keep-alive',
            'Content-Length': '14',  # get方式提交的数据长度，如果是post方式，转成get方式：【id=wdb&pwd=wdb】
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': 'http://10.1.2.151/',
            'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36'
        }
        data = {
	        "reqType":0,
             "perception": {
                "inputText": {
                    "text": info
                },
                "selfInfo": {
                    "location": {
                        "city": "北京",
                        "province": "北京",
                        "street": "信息路"
                    }
                }
             },
            "userInfo": {
                "apiKey": usedRobbot,
                "userId": "me"
            }
        }
        # 这里是为了找到 id 包含 FromUserName 和 ToUserName
        print(data)
        conn = http.client.HTTPConnection(tulingDomain)
        header = {"Content-type": "application/json"}
        conn.request(method="POST", url=tulingOpenapiUrl, headers=header, body=json.dumps(data))
        response = conn.getresponse()
        # print(response.status)
        # print(response.reason)
        res = response.read()
        # print(res)
        resp = json.loads(res)
        #print(resp)
        #print(type(resp))
 
        reponseType = resp['results'][0]['values']
        print(reponseType)
        print(type(reponseType))
 
        #str = input("回复：")
        itchat.send(reponseType['text'],toUserName=touserNameId)
 
 
if __name__ == '__main__':
    # 退出程序以后还暂存登录状态 https://blog.csdn.net/yan88888888888888888/article/details/89373626
    itchat.auto_login(hotReload=True)
 
    # 给文件助手发消息
    itchat.send("文件助手你好哦", toUserName="filehelper")
    itchat.run()

图灵机器人回复微信

from wxpy import *

import wx_friend

# 微信机器人，缓存登录信息
bot = Bot(cache_path=True)


@bot.register(msg_types=FRIENDS)
def auto_reply(msg):
    """自动接受好友请求"""
    wx_friend.auto_accept_friends(msg)


@bot.register(chats=Friend)
def auto_reply(msg):
    """自动回复好友"""
    if msg.type == TEXT:
        wx_friend.auto_reply(msg)
    elif msg.type == RECORDING:
        return '不听不听，王八念经'
    else:
        pass


embed()

"""
    免费申请图灵机器人，获取api_key
    图灵机器人免费申请地址 http://www.tuling123.com
"""
tuling = Tuling(api_key='7c8cdb56b0dc4450a8deef30a496bd4c')


def auto_reply(msg):
    """回复消息，并返回答复文本"""
    return tuling.do_reply(msg)


if __name__ == '__main__':
    """
        直接点击测试图灵机器人
        此apikey为wxpy自带apikey，有使用次数限制，建议自己免费申请一个
        图灵机器人免费申请地址 http://www.tuling123.com
    """
    apikey = '7c8cdb56b0dc4450a8deef30a496bd4c'
    api_url = 'http://www.tuling123.com/openapi/api'
    data = {'key': apikey, 'info': '你好'}
    req = requests.post(api_url, data=data).text
    replys = json.loads(req)['text']
    print(replys)
    
    bot = Bot()
    
    
    def auto_accept_friends(msg):
        """自动接受好友https://github.com/pig6/wxrobot/blob/master/wx_friend.py"""
        # 接受好友请求
        new_friend = msg.card.accept()
        # 向新的好友发送消息
        new_friend.send('我已自动接受了你的好友请求')
    
    
    def auto_reply(msg):
        """自动回复"""
        # 关键字回复 or 图灵机器人回复
        keyword_reply(msg) or tuling_reply(msg)
    
    
    def keyword_reply(msg):
        """关键字回复"""
        if '你叫啥' in msg.text or '你叫啥名字' in msg.text:
            return msg.reply('沃德天·维森莫·拉莫帅·帅德布耀')
        pass
    
    
    def tuling_reply(msg):
        """图灵机器人回复"""
        tuling_robot.auto_reply(msg)

微信防撤回

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

__author__ = 'jiangwenwen'

import itchat
from itchat.content import *
import time
import re
import os

print("该程序由里客云资源站开发，网址：likeyunba.com https://segmentfault.com/a/1190000018691537")
print("作者:TANKING")
print("打开程序会弹出一个二维码，微信扫码")
print("如果二维码弹不出，那就在你这个程序的同一个目录下找到QR.png双击打开扫码")
print("扫码后，出现Start auto replying就可以实时监控消息了...")

msg_information = {}
# 针对表情包的内容
face_bug = None

@itchat.msg_register([TEXT, PICTURE, FRIENDS, CARD, MAP, SHARING, RECORDING, ATTACHMENT, VIDEO], isFriendChat=True, isMpChat=True)
def handle_receive_msg(msg):
    global face_bug
    # 接收消息的时间
    msg_time_rec = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
    # 在好友列表列表中查询发送信息的好友昵称
    msg_from = itchat.search_friends(userName=msg['FromUserName'])['NickName']
    # 信息发送的时间
    msg_time = msg['CreateTime']
    # 每条信息的ID
    msg_id = msg['MsgId']
    # 储存信息的内容
    msg_content = None
    # 储存分享的连接，比如分享的文章和音乐
    msg_share_url = None

    # 如果发送的消息是文本或者好友推荐
    if msg['Type'] == 'Text' or msg['Type'] == 'Friends':
        msg_content = msg['Text']
        print(msg_content)

    # 如果发送的消息是附件，视频，图片，语音
    elif msg['Type'] == 'Attachment' or msg['Type'] == 'Video' \
        or msg['Type'] == 'Picture'\
            or msg['Type'] == 'Recording':
        # 内容为下载文件名
        msg_content = msg['FileName']
        msg['Text'](str(msg_content))

    # 如果消息是推荐的名片
    elif msg['Type'] == 'Card':
        # 内容是推荐人的昵称和性别
        msg_content = msg['RecommendInfo']['NickName'] + '的名片'
        if msg['RecommendInfo']['Sex'] == 1:
            msg_content += '性别为男'
        else:
            msg_content += '性别为女'

        print(msg_content)

    # 如果消息为分享的位置信息
    elif msg['Type'] == 'Map':
        x, y, location = re.search(
            "<location x=\"(.*?)\" y=\"(.*?)\".*label=\"(.*?)\".*", msg['OriContent']).group(1, 2, 3)
        if location is None:
            # 内容为详细地址
            msg_content = r'纬度->' + x.__str__() + "经度->" + y.__str__()
        else:
            msg_content = r"" + location

    # 如果消息是分享的音乐或者文章，详细的内容为文章的标题或者分享的名字
    elif msg['Type'] == 'Sharing':
        msg_content = msg['Text']
        msg_share_url = msg['Url']
        print(msg_share_url)
    face_bug = msg_content

    # 将信息存储在字典中，每一个msg_id对应一条消息
    msg_information.update(
        {
            msg_id: {
                "msg_from": msg_from, "msg_time": msg_time, "msg_time_rec": msg_time_rec,
                "msg_type": msg['Type'],
                "msg_content": msg_content, "msg_share_url": msg_share_url
            }
        }
)

#这个是用于监听是否有friend消息撤回
@itchat.msg_register(NOTE, isFriendChat=True, isGroupChat=True, isMpChat=True)
def information(msg):
    # 这里如果这里的msg['Content']中包含消息撤回和id，就执行下面的语句
    if '撤回了一条消息' in msg['Content']:
        old_msg_id = re.search("\<msgid\>(.*?)\<\/msgid\>", msg['Content']).group(1)
        # 得到消息
        old_msg = msg_information.get(old_msg_id)
        print(old_msg)

        # 如果发送的是表情
        if len(old_msg_id)<11:
            itchat.send_file(face_bug, toUserName='filehelper')
        # 发送撤回的提示给文件助手
        else:
            msg_body = "【"\
                       + old_msg.get('msg_from') + "撤回了】\n"\
                       + old_msg.get("msg_type") + "消息:" + "\n"\
                       + old_msg.get("msg_time_rec") + "\n"\
                       + r"" + old_msg.get("msg_content")

        # 如果分享的文件被撤回了，那么就将分享的url加在msg_body中发送给文件助手
        if old_msg['msg_type'] == "Sharing":
            msg_body += "\n就是这个链接>" + old_msg.get('msg_share_url')

        # 将撤回消息发送到文件助手
        itchat.send_msg(msg_body, toUserName="filehelper")

        # 有文件的话也要将文件发送回去
        if old_msg["msg_type"] == "Picture"\
                or old_msg["msg_type"] == "Recording"\
                or old_msg["msg_type"] == "Video"\
                or old_msg["msg_type"] == "Attachment":
            file = "@fil@%s" % (old_msg['msg_content'])
            itchat.send(msg=file, toUserName='filehelper')
            os.remove(old_msg['msg_content'])

        # 删除字典旧信息
        msg_information.pop(old_msg_id)

itchat.auto_login(hotReload=True)
itchat.run()

自动加群

    #coding=utf-8
from wxpy import *
base_bot = Bot(True)
# 查找指定群聊群聊 ， 确保扫码微信中有 “123” 这个群
group = base_bot.groups().search('123')
# 自动接受新的好友请求
@base_bot.register(msg_types=FRIENDS)
def auto_accept_friends(msg):
    if '加好友' in msg.text.lower():
        # 接受好友请求
        new_friend = base_bot.accept_friend(msg.card)
        # new_friend = msg.card.accept()
        # 向新的好友发送消息
        new_friend.send('你好，我是群聊机器人，回复"入群"口令进入群聊天哦！')
# 接收文字消息的装饰器
@base_bot.register(msg_types=TEXT)
def add_into_chatroom(msg):
    # 接收进群口令
    if msg.text.lower() == '入群':
        # use_invitation为True，发送群邀请，False则拉进群聊
        group[0].add_members(msg.sender, use_invitation=True)
    else:
        # 其他消息
        return u'收到：' + msg.text
base_bot.join()

for sex, count in friends_stat["sex"].iteritems():
    # 1代表MALE, 2代表FEMALE
    if sex == 1:
        print "MALE %d" % count
    elif sex == 2:
        print "FEMALE %d" % count
        
 from wxpy import *
 
 bot = Bot(cache_path=True)
 friends_stat = bot.friends().stats()
 
 friend_loc = [] # 每一个元素是一个二元列表，分别存储地区和人数信息
 for province, count in friends_stat["province"].iteritems():
     if province != "": 
         friend_loc.append([province, count])
 
 # 对人数倒序排序
 friend_loc.sort(key=lambda x: x[1], reverse=True)
 
 # 打印人数最多的10个地区
 for item in friend_loc[:10]:
     print item[0], item[1]

群发消息

import time
# 初始化一个机器人对象https://yfzhou.coding.me/2018/09/04/%E5%BE%AE%E4%BF%A1%E6%9C%80%E5%BC%BA%E8%8A%B1%E5%BC%8F%E6%93%8D%E4%BD%9C%EF%BC%8C%E5%B8%A6%E4%BD%A0%E7%8E%A9%E8%BD%AC-wxpy/
#https://gitee.com/ShaErHu/wxpy_matplotlib_learning
# cache_path为登录状态缓存路径，给定值为第一次登录生成的缓存文件路径
bot = Bot(cache_path="D:\PycharmProjects\pythonProcedure\com\zyf\weixin\wxpy.pkl")

# 群发消息（谨慎使用，哈哈哈）
my_friends = bot.friends(update=False)
my_friends.pop(0)   # 去除列表第一个元素（自己）
for i in range(120): # 时间限制2分钟内最多发120次（具体看wxpy官方文档异常处理）
    friend = my_friends[i]
    friend.send('Good morning,the early bird catches the worm!(早上好，早起的鸟儿有虫吃！)')
    time.sleep(2)
    friend.send('不用回复，生活中一起加油！')
    
   
# 获取所有好友[返回列表包含Chats对象(你的所有好友，包括自己)]
t0 = bot.friends(update=False)
# 查看自己好友数(除开自己)
print("我的好友数："+str(len(t0)-1))

# 获取所有微信群[返回列表包含Groups对象]
t1 = bot.groups(update=False)
# 查看微信群数(活跃的)
print("我的微信群聊数："+str(len(t1)))

# 获取所有关注的微信公众号[返回列表包含Chats对象]
t2 = bot.mps(update=False)
# 查看关注的微信公众号数
print("我关注的微信公众号数："+str(len(t2)))
# 初始化一个机器人对象
# cache_path缓存路径，给定值为第一次登录生成的缓存文件路径
bot = Bot(cache_path="D:\PycharmProjects\pythonProcedure\com\zyf\weixin\wxpy.pkl")
#获取好友列表(包括自己)
my_friends = bot.friends(update=False)
'''
stats_text 函数：帮助我们简单统计微信好友基本信息
简单的统计结果的文本
    :param total: 总体数量
    :param sex: 性别分布
    :param top_provinces: 省份分布
    :param top_cities: 城市分布
    :return: 统计结果文本
'''
print(my_friends.stats_text())
bot = Bot(cache_path="D:\PycharmProjects\pythonProcedure\com\zyf\weixin\wxpy.pkl")
#获取好友列表(包括自己)
my_friends = bot.friends(update=False)
# 微信昵称
nick_name = ''
# 微信个性签名
wx_signature = ''
for friend in my_friends:
    # 微信昵称：NickName
    nick_name = nick_name + friend.raw['NickName']
    # 个性签名：Signature
    wx_signature = wx_signature + friend.raw['Signature']

nick_name = jiebaclearText(nick_name)
wx_signature = jiebaclearText(wx_signature)
make_wordcloud(nick_name,1)
make_wordcloud(wx_signature,2)
# 获取微信公众号名称
wx_public_name = ''
# 公众号简介
wx_pn_signature = ''
# 获取微信公众号列表
my_wx_pn = bot.mps(update=False)
for wx_pn in my_wx_pn:
    wx_public_name = wx_public_name + wx_pn.raw['NickName']
    wx_pn_signature = wx_pn_signature + wx_pn.raw['Signature']

wx_public_name = jiebaclearText(wx_public_name)
make_wordcloud(wx_public_name,3)
wx_pn_signature = jiebaclearText(wx_pn_signature)
make_wordcloud(wx_pn_signature,4)

人脸识别

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/11/5 14:21
# @Author  : yfzhou
# @Site    : 
# @File    : face_id.py
# @Software: PyCharm
# Life is short, I use python.
# 人脸识别https://yfzhou.coding.me/2018/11/12/OpenCV-Python-%E5%AE%9E%E7%8E%B0%E7%AE%80%E5%8D%95%E4%BA%BA%E8%84%B8%E8%AF%86%E5%88%AB/

import cv2

filename = "C:\\Users\\Administrator\\Pictures\\2018-05-16\\1865.JPG"

def detect(filename):
    # haarcascade_frontalface_default.xml存储在package安装的位置
    # haarcascade_frontalface_default 识别人脸
    # haarcascade_eye 识别眼睛
    face_cascade = cv2.CascadeClassifier(
        "D:\\Python\\Python37\\Lib\\site-packages\\cv2\\data\\haarcascade_frontalface_default.xml")
    eye_cascade = cv2.CascadeClassifier("D:\\Python\\Python37\\Lib\\site-packages\\cv2\\data\\haarcascade_eye.xml")
    img = cv2.imread(filename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # 传递参数是scaleFactor和minNeighbors,分别表示人脸检测过程中每次迭代时图像的压缩率以及每个人脸矩形保留近邻数目的最小值
    # 检测结果返回人脸矩形数组
    faces = face_cascade.detectMultiScale(gray, 1.1, 5)
    print(faces)
    for (x, y, w, h) in faces:
        # 给最新的检测到的人脸图片外面，标明一个方框
        img = cv2.rectangle(img, (x, y), (x + w, y + h), (255, 0, 0), 3)
        face_re = img[y:y + h, x:x + h]
        face_re_g = gray[y:y + h, x:x + h]
        eyes = eye_cascade.detectMultiScale(face_re_g)
        for (ex, ey, ew, eh) in eyes:
            cv2.rectangle(face_re, (ex, ey), (ex + ew, ey + eh), (0, 255, 0), 2)

    # cv2.namedWindow("Human Face Result!")
    # cv2.imshow("Human Face Result!", img)
    # 吧识别后的图片保存至指定目录
    cv2.imwrite("C:\\Users\\Administrator\\Pictures\\2018-05-16\\Face.jpg", img)
    # cv2.waitKey(0)
    # cv2.destroyAllWindows()

detect(filename)

图片水印


from PIL import Image
from PIL import Image,ImageDraw,ImageFont
image = Image.open('wechat.png')
# 打开等待加水印的图片
watermark = Image.open('mp.png')
# 打开水印图片
factor = 1
# 如果觉得水印图片太大，可以缩放，这里缩放比例为50%
watermark = watermark.resize(
    tuple(map(lambda x: int(x * factor), watermark.size)))
# 缩放图片
layer=Image.new('RGBA',image.size)
# 生成一个新的layer
layer.paste(watermark,(image.size[0]-watermark.size[0],
    image.size[1]-watermark.size[1]))
# 把水印打到新的layer上去，后面参数是水印位置，此处是右下角    
marked_img=Image.composite(layer,image,layer)
# 添加水印
# marked_img.show()# 打开生成的图片（缓存图片）
marked_img.save('wechat_remark.jpg')
# 保存图片

#文字水印https://www.qingwei.tech/programe-develops/python/1154.html
image = Image.open('lifeistoshort.jpg')
# 打开要加水印的图片
text=input('输入你的水印文字:\n')
# 提示要打水印的文字
font=ImageFont.truetype('C:\Windows\Fonts\simhei.ttf',64)
# 获得一个字体，你也可以自己下载相应字体，第二个值是字体大小
layer=image.convert('RGBA')
# 将图片转换为RGBA图片
text_overlay=Image.new('RGBA',layer.size)
# 依照目标图片大小生成一张新的图片 参数[模式,尺寸,颜色(默认为0)]
image_draw=ImageDraw.Draw(text_overlay)
# 画图
text_size_x,text_size_y=image_draw.textsize(text,font=font)
# 获得字体大小,textsize(text, font=None)
text_xy=(layer.size[0]-text_size_x,layer.size[1]-text_size_y)
# 设置文本位置 此处是右下角显示
image_draw.text(text_xy, text, font=font, fill=(0, 0, 0, 85))
# 设置文字，位置,字体,颜色和透明度
marked_img=Image.alpha_composite(layer,text_overlay)
# 将水印打到原图片上生成新的图片
marked_img.save('qingwei_after.png')
# 保存图片
marked_img.show()
# 显示图片（这里是生成一个临时文件，必须关闭图片 这段py代码才算结束）

Python骚操作：微信远程控制电脑

pip install opencv-python
pip install matplotlib
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2018/8/20 11:12
# @Author  : yfzhou
# @Site    : https://yfzhou.coding.me/2018/08/20/Python%E9%AA%9A%E6%93%8D%E4%BD%9C%EF%BC%9A%E5%BE%AE%E4%BF%A1%E8%BF%9C%E7%A8%8B%E6%8E%A7%E5%88%B6%E7%94%B5%E8%84%91/
# @File    : wechat_control_computer.py
# @Software: PyCharm
# Life is short, I use python.

import itchat
import os
import time
import cv2

sendMsg = u"{消息助手}：暂时无法回复"
usageMsg = u"使用方法：\n1.运行CMD命令：cmd xxx (xxx为命令)\n" \
           u"-例如关机命令:\ncmd shutdown -s -t 0 \n" \
           u"2.获取当前电脑用户：cap\n3.启用消息助手(默认关闭)：ast\n" \
           u"4.关闭消息助手：astc"
flag = 0  # 消息助手开关
nowTime = time.localtime()
filename = str(nowTime.tm_mday) + str(nowTime.tm_hour) + str(nowTime.tm_min) + str(nowTime.tm_sec) + ".txt"
myfile = open(filename, 'w')


@itchat.msg_register('Text')
def text_reply(msg):
    global flag
    message = msg['Text']
    fromName = msg['FromUserName']
    toName = msg['ToUserName']

    if toName == "filehelper":
        if message == "cap":
            cap = cv2.VideoCapture(0)
            ret, img = cap.read()
            cv2.imwrite("weixinTemp.jpg", img)
            itchat.send('@img@%s' % u'weixinTemp.jpg', 'filehelper')
            cap.release()
        if message[0:3] == "cmd":
            os.system(message.strip(message[0:4]))
        if message == "ast":
            flag = 1
            itchat.send("消息助手已开启", "filehelper")
        if message == "astc":
            flag = 0
            itchat.send("消息助手已关闭", "filehelper")
    elif flag == 1:
        itchat.send(sendMsg, fromName)
        myfile.write(message)
        myfile.write("\n")
        myfile.flush()


if __name__ == '__main__':
    itchat.auto_login()
    itchat.send(usageMsg, "filehelper")
    itchat.run()

生成中文词云图

pip install wordcloud
pip install jieba
# -*- coding: utf-8 -*-
# @Time    : 2018/9/4 13:52
# @Author  : yfzhou
# @Site    : 
# @File    : demo10.py
# @Software: PyCharm
# Life is short, I use python.

# 词云生成工具
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from os import path
import jieba

# 获取当前的项目文件加的路径
d = path.dirname(__file__)
# 读取一个txt文件https://yfzhou.coding.me/2018/09/04/Python-wordcloud-jieba-%E7%94%9F%E6%88%90%E4%B8%AD%E6%96%87%E8%AF%8D%E4%BA%91%E5%9B%BE/
text = open(r'C:\Users\Administrator\Desktop\阿里传：这是阿里巴巴的世界美特里斯曼.txt', 'r', encoding='utf-8').read()
# 读入背景图片
bg_pic = plt.imread(r'C:\Users\Administrator\Pictures\Other\155061877268618276.jpg')
wordlist_after_jieba = jieba.cut(text, cut_all=True)
wl_space_split = " ".join(wordlist_after_jieba)
# 生成词云
font = d + r'static/simkai.ttf'
wc = WordCloud(
    mask=bg_pic,
    background_color='white',
    font_path=font,
    scale=1.5,
    max_words=1500
).generate(wl_space_split)
image_colors = ImageColorGenerator(bg_pic)
# 图片背景
bg_color = ImageColorGenerator(bg_pic)
# 开始画图
plt.imshow(wc.recolor(color_func=bg_color))
plt.axis('off')
plt.show()
# 保存图片
wc.to_file(d + r"/image/render_09.png")

朗读网页

pip install readability-lxml
import requests
from readability import Document
pip install goose3
response = requests.get('https://hoxis.github.io/run-ansible-without-specifying-the-inventory-but-the-host-directly.html')
doc = Document(response.text)
print(doc.title())
#https://yfzhou.coding.me/2018/09/05/%E6%89%8B%E6%8A%8A%E6%89%8B%E6%95%99%E4%BD%A0%E7%94%A8-Python-%E6%9D%A5%E6%9C%97%E8%AF%BB%E7%BD%91%E9%A1%B5/
>>> from goose3 import Goose
>>> from goose3.text import StopWordsChinese
>>> url  = 'http://news.china.com/socialgd/10000169/20180616/32537640_all.html'
>>> g = Goose({'stopwords_class': StopWordsChinese})
>>> article = g.extract(url=url)
>>> print(article.cleaned_text[:150])
北京时间6月15日23:00(圣彼得堡当地时间18:00)，2018年世界杯B组一场比赛在圣彼得堡球场展开角逐，伊朗1比0险胜摩洛哥，伊朗前锋阿兹蒙半场结束前错过单刀机会，鲍哈杜兹第95分钟自摆乌
龙。这是伊朗20年来首度在世界杯决赛圈取胜。

本届世界杯，既相继出现替补便进球，贴补梅开二度以及东道主

pip install playsound
> from playsound import playsound
>>> playsound('/path/to/a/sound/file/you/want/to/play.mp3')
#https://github.com/hoxis/to_voice/blob/master/page2voice.py

定时检测无响应进程并重启

#http://www.xetlab.com/2019/04/21/python%E7%BB%83%E6%89%8B%E8%84%9A%E6%9C%AC-%E5%AE%9A%E6%97%B6%E6%A3%80%E6%B5%8B%E6%97%A0%E5%93%8D%E5%BA%94%E8%BF%9B%E7%A8%8B%E5%B9%B6%E9%87%8D%E5%90%AF/
import os
import time

import schedule


def parse_output(output):
    print(output)
    pid_list = []
    lines = output.strip().split("\n")
    if len(lines) > 2:
        for line in lines[2:]:
            pid_list.append(line.split()[1])
    return pid_list


def list_not_response(process_name):
    return list_process(process_name, True)


def list_process(process_name, not_respond=False):
    cmd = 'tasklist /FI "IMAGENAME eq %s"'
    if not_respond:
        cmd = cmd + ' /FI "STATUS eq Not Responding"'
    output = os.popen(cmd % process_name)
    return parse_output(output.read())


def start_program(program):
    os.popen(program)


def check_job():
    process_name = "xx.exe"
    not_respond_list = list_not_response(process_name)
    if len(not_respond_list) <= 0:
        return
    pid_params = " ".join(["/PID " + pid for pid in not_respond_list])
    os.popen("taskkill /F " + pid_params)
    if len(list_process(process_name)) <= 0:
        start_program(r'E:\xxx\xx.exe')


if __name__ == '__main__':
    schedule.every(5).seconds.do(check_job)
    while True:
        schedule.run_pending()
        time.sleep(1)

多线程PDF转Word

import os
from configparser import ConfigParser
from io import StringIO
from io import open
from concurrent.futures import ProcessPoolExecutor

from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from docx import Document
#https://github.com/python-fan/pdf2word/blob/master/main.py

def read_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        resource_manager = PDFResourceManager()
        return_str = StringIO()
        lap_params = LAParams()

        device = TextConverter(
            resource_manager, return_str, laparams=lap_params)
        process_pdf(resource_manager, device, file)
        device.close()

        content = return_str.getvalue()
        return_str.close()
        return content


def save_text_to_word(content, file_path):
    doc = Document()
    for line in content.split('\n'):
        paragraph = doc.add_paragraph()
        paragraph.add_run(remove_control_characters(line))
    doc.save(file_path)


def remove_control_characters(content):
    mpa = dict.fromkeys(range(32))
    return content.translate(mpa)


def pdf_to_word(pdf_file_path, word_file_path):
    content = read_from_pdf(pdf_file_path)
    save_text_to_word(content, word_file_path)


def main():
    config_parser = ConfigParser()
    config_parser.read('config.cfg')
    config = config_parser['default']

    tasks = []
    with ProcessPoolExecutor(max_workers=int(config['max_worker'])) as executor:
        for file in os.listdir(config['pdf_folder']):
            extension_name = os.path.splitext(file)[1]
            if extension_name != '.pdf':
                continue
            file_name = os.path.splitext(file)[0]
            pdf_file = config['pdf_folder'] + '/' + file
            word_file = config['word_folder'] + '/' + file_name + '.docx'
            print('正在处理: ', file)
            result = executor.submit(pdf_to_word, pdf_file, word_file)
            tasks.append(result)
    while True:
        exit_flag = True
        for task in tasks:
            if not task.done():
                exit_flag = False
        if exit_flag:
            print('完成')
            exit(0)


if __name__ == '__main__':
    main()

自动化测试工具selenium

#http://jeffyang.top/Python/%E7%88%AC%E8%99%AB/Python%E7%88%AC%E8%99%AB%E5%B8%B8%E7%94%A8%E5%BA%93selenium%E8%AF%A6%E8%A7%A3/
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
    input = browser.find_element_by_id('kw')
    input.send_keys('Python')
    input.send_keys(Keys.ENTER)
    wait = WebDriverWait(browser, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
    print(browser.current_url)
    print(browser.get_cookies())
    print(browser.page_source)
finally:
    browser.close()

字符串模糊匹配

from fuzzywuzzy import process

courts = ['北京市第二中级人民法院','北京市第三中级人民法院','北京市石景山区人民法院']

print(process.extractOne('北京第三中院', courts)[0])
print(process.extractOne('石景山区法院', courts)[0])

输出https://www.dust8.com/2019/04/20/fuzzywuzzy/

北京市第三中级人民法院
北京市石景山区人民法院

判断图片是否损坏

from PIL import Image


def is_valid_image(filename):
    valid = True
    try:
        Image.open(filename).load()
    except OSError:
        valid = False
    return valid

unicode转中文

1 2	str.encode('utf-8').decode('unicode-escape') https://tangx1.com/solved/

文字识别ocr

//默认使用eng文字库， imgName是图片的地址，result识别结果
tesseract imgName result
指定语言:

//指定使用简体中文
tesseract -l chi_sim imgName result

//查看本地存在的语言库
tesseract --list-langs
指定多语言:

//指定多语言，用+号相连
tesseract -l chi_sim+eng imgName result
通过 pip 安装支持Python 版本的 Tesseract库

pip install pytesseract
通过Python代码的简单实现

import pytesseract
from PIL import Image

image = Image.open('/Users/admin/Desktop/test.jpg')
text = pytesseract.image_to_string(image)
print text
https://zhuanlan.zhihu.com/p/31530755

https://sourceforge.net/projects/tesseract-ocr/
https://github.com/tesseract-ocr/tesseract
https://github.com/UB-Mannheim/tesseract/wiki 
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.1.0-bibtag19.exe
在windows可以通过exe安装包安装，下载地址可以从GitHub项目中的wiki找到。安装完成后记得将Tesseract 执行文件的目录加入到PATH中，方便后续调用。
img = Image.open("vm3.png");
text = image_to_string(img,lang='chi_sim')
print(text)
https://segmentfault.com/a/1190000015489113

tesseract  520.png outfile
Tesseract Open Source OCR Engine v4.1.0-bibtag19 with Leptonica
git clone https://github.com/tesseract-ocr/tessdata  太慢了
https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata 直接下载这个文件
wget -c https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata?raw=true
下载好的语言包放入到安装目录中的testdata下即可。在windows系统你还需要将testdata目录也加入环境变量。

tesseract test.png outfile -l chi_sim
Tesseract Open Source OCR Engine v4.1.0-bibtag19 with Leptonica
英文加汉子用-l eng+chi_sim
https://betacat.online/posts/2018-01-16/chinese-text-ocr-via-python/
http://qinghua.github.io/tesseract/
from PIL import Image
import pytesseract

class Languages:
    CHS = 'chi_sim'
    CHT = 'chi_tra'
    ENG = 'eng'

def img_to_str(image_path, lang=Languages.ENG):
    return pytesseract.image_to_string(Image.open(image_path), lang)
  
print(img_to_str('image/test1.png', lang=Languages.CHS))
print(img_to_str('image/test2.png', lang=Languages.CHS))
虽然tesseract不能直接处理PDF，但是借助ImageMagick和Ghostscript可以轻松地把PDF转换成图片文件：
 
brew install imagemagick
brew install ghostscript
convert -density 100 -trim input.pdf output%04d.jpg
在线 https://gongpeione.github.io/quick-js-ocr/example/

pydub音频处理库

from pydub import AudioSegment
song = AudioSegment.from_wav("never_gonna_give_you_up.wav")
song = AudioSegment.from_mp3("never_gonna_give_you_up.mp3")
ogg_version = AudioSegment.from_ogg("never_gonna_give_you_up.ogg")
flv_version = AudioSegment.from_flv("never_gonna_give_you_up.flv")
mp4_version = AudioSegment.from_file("never_gonna_give_you_up.mp4", "mp4")
wma_version = AudioSegment.from_file("never_gonna_give_you_up.wma", "wma")
aac_version = AudioSegment.from_file("never_gonna_give_you_up.aiff", "aac")
awesome.export("mashup.mp3", format="mp3")
awesome.export("mashup.mp3", format="mp3", tags={'artist': 'Various artists', 'album': 'Best of 2011', 'comments': 'This album is awesome!'})
awesome.export("mashup.mp3", format="mp3", bitrate="192k")
https://xin053.github.io/2016/11/05/pydub%E9%9F%B3%E9%A2%91%E5%A4%84%E7%90%86%E5%BA%93%E4%BD%BF%E7%94%A8%E8%AF%A6%E8%A7%A3/ 将目录下的所有mp4文件和flv文件转换为mp3
import os
import glob
from pydub import AudioSegment
video_dir = '/home/johndoe/downloaded_videos/'  # Path where the videos are located
extension_list = ('*.mp4', '*.flv')
os.chdir(video_dir) # change the workplace
for extension in extension_list:
    for video in glob.glob(extension):
        mp3_filename = os.path.splitext(os.path.basename(video))[0] + '.mp3'
        AudioSegment.from_file(video).export(mp3_filename, format='mp3')

dataset简易数据库

import dataset
db = dataset.connect('sqlite:///:memory:')
db = dataset.connect('mysql://user:password@localhost/mydatabase')

table = db['sometable']
table.insert(dict(name='John Doe', age=37))
table.insert(dict(name='Jane Doe', age=34, gender='female'))
john = table.find_one(name='John Doe')
OrderedDict([('id', 1), ('name', 'John Doe'), ('age', 37), ('gender', None)])

table.update(dict(name='John Doe', age=47), ['name'])
第二个参数相当于sql update语句中的where，用来过滤出需要更新的记录
with dataset.connect() as tx:
    tx['user'].insert(dict(name='John Doe', age=46, country='China'))
事务操作可以简单的使用上下文管理器来实现,出现异常，将会回滚
db = dataset.connect()
db.begin()
try:
    db['user'].insert(dict(name='John Doe', age=46, country='China'))
    db.commit()
except:
    db.rollback()
>>> print(db)
<Database(sqlite:///mydatabase.db)>
>>> print(db.tables)
['user']
>>> print(db['user'].columns)
['id', 'country', 'name', 'age', 'gender']
>>> print(len(db['user']))
2
>>> table = db['user']
>>> table
<Table(user)>
>>> table.table
Table('user', MetaData(bind=Engine(sqlite:///mydatabase.db)), Column('id', INTEGER(), table=<user>, primary_key=True, nullable=False), Column('country', TEXT(), table=<user>), Column('name', TEXT(), table=<user>), Column('age', INTEGER(), table=<user>), Column('gender', TEXT(), table=<user>), schema=None)
    db['user'].distinct('country')

    table.delete(place='Berlin')
result = db.query('SELECT country, COUNT(*) c FROM user GROUP BY country')
for row in result:
   print(row['country'], row['c'])
   
   result = db['users'].all()
   dataset.freeze(result, format='json', filename='users.json')
https://xin053.github.io/2016/11/08/dataset%E7%AE%80%E6%98%93%E6%95%B0%E6%8D%AE%E5%BA%93%E5%8C%85%E4%BD%BF%E7%94%A8%E8%AF%A6%E8%A7%A3/

Tesseract图片文字识别

brew install tesseract
https://github.com/tesseract-ocr/tesseract/wiki
tesseract paper.png paper -l chi_sim
tesseract paper.png paper -l chi_sim -c language_model_ngram_on=1
tesseract paper.png paper -l chi_sim tess.conf
https://tonydeng.github.io/2016/07/28/on-the-use-of-tesseract-picture-text-recognition/
	
tesseract -h
Usage:
  D:\Tesseract\tesseract.exe --help | --help-extra | --version
  D:\Tesseract\tesseract.exe --list-langs
  D:\Tesseract\tesseract.exe imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.
	
tesseract 1.jpg test -l chi_sim
-l chi_sim表示用训练的中文数据库识别图片中的文字，不带-l默认使用英文，多个语言之间使用+连接

执行上面的命令以后，会在桌面生成一个test.txt文件，文件内容就是识别的文字
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open(r"C:\Users\zzx\Desktop\1.jpg")))
print(pytesseract.image_to_string(Image.open(r"C:\Users\zzx\Desktop\1.jpg"), lang='fra'))

f = open(output_file_name)
try:
    return f.read().strip()
finally:
    f.close()
https://xin053.github.io/2016/10/28/Tesseract%E5%85%89%E5%AD%A6%E8%AF%86%E5%88%AB/

qrcode二维码生成库

pip install qrcode
qr "Some text" > test.png
import qrcode
img = qrcode.make('Some data here')
img.save(r"C:\Users\zzx\Desktop\test.jpg")
import qrcode
qr = qrcode.QRCode(
    version=1,
    error_correction=qrcode.constants.ERROR_CORRECT_L,
    box_size=10,
    border=4,
)
qr.add_data('Some data')
qr.make(fit=True)
img = qr.make_image()

https://xin053.github.io/2016/10/28/qrcode%E4%BA%8C%E7%BB%B4%E7%A0%81%E7%94%9F%E6%88%90%E5%BA%93%E4%BD%BF%E7%94%A8%E8%AF%A6%E8%A7%A3/

FFMpeg合并视频

1
2
3

ffmpeg -i "工程师的痛只有工程师能懂_高清-XNzE1NTk3Mzky_part1.flv" -c copy -bsf:v h264_mp4toannexb -f mpegts 1.ts
ffmpeg -i "工程师的痛只有工程师能懂_高清-XNzE1NTk3Mzky_part2.flv" -c copy -bsf:v h264_mp4toannexb -f mpegts 2.ts
ffmpeg -i "concat:1.ts|2.ts" -c copy -bsf:a aac_adtstoasc "工程师的痛只有工程师能懂.mp4"

新闻网页正文通用抽取器


# 使用 pip 安装
pip install --upgrade git+https://github.com/kingname/GeneralNewsExtractor.git

# 使用 pipenv 安装
pipenv install git+https://github.com/kingname/GeneralNewsExtractor.git#egg=gne
使用 GNE
>>> from gne import GeneralNewsExtractor

>>> html = '''经过渲染的网页 HTML 代码'''

>>> extractor = GeneralNewsExtractor()
>>> result = extractor.extract(html, noise_node_list=['//div[@class="comment-list"]'])
>>> print(result)

{"title": "xxxx", "publish_time": "2019-09-10 11:12:13", "author": "yyy", "content": "zzzz"}
在 Chrome 浏览器中打开对应页面，然后开启开发者工具 在Elements标签页定位到<html>标签，并右键，选择Copy-Copy OuterHTML，如下图所示

基于Pyqt5的电影天堂电影搜索工具

https://github.com/lt94/MovieHeavens
pip3 install pyqt5
python3 movies.py

https://mp.weixin.qq.com/s?__biz=MzA3Nzc4MzY2NA==&mid=2247485581&amp;idx=1&amp;sn=56825bf55a43c727b594d97ccbdbaff5&source=41#wechat_redirect

Windows下

# only python3 is supported
pip install pyinstaller
# -w 不能省略,不然会运行过程中会控制台界面
pyinstaller -F -w ./movies.py ./movieSource/MovieHeaven.py ./movieSource/fake_user_agent.py

安装 pyinstaller 失败

pip install pyinstaller
提示失败
pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

把命令改为 > pip install pyinstaller --timeout 1000
增加 timeout 时间，运行成功。

Python学习重点摘记

有趣的Python爬虫和Python数据分析小项目

识别图片中的文字 - Tesseract 和百度云OCR的对比

python来语音转文字

网易云音乐综合爬虫Python库

两分钟获得数千个有效代理

Python爬虫常用工具selenium phantomjs pyquery

网易云音乐助手

Python库Numpy使用入门

Python爬虫常用库pyquery详解

Python爬虫常用库selenium详解

python实现excel二维表格格式化