如何用 Python 在不同的 Excel 中抓取資料並整合成一個 Excel
在上一篇用 Python 得出資料夾項下的各子資料夾 (opens new window),我們知道了如何用 os.walk
抓出資料夾底下所有檔案及路徑
這篇,我們一樣利用這個方法來找出資料夾中的所有 Excel,並從這些 Excel 中抓出我們想要的資料
最後,再將這些資料存成一個 Excel
這篇接續這個案例:使用 Python 的 Pandas 套件來統計打卡時間 (opens new window)
我們每天都會產出一個出勤報表,但若要找出一個員工在一段區間的報表,那不就要打開每一天的資料夾,並從中找出該員工每天的出勤紀錄呢?
有沒有辦法在多個 Excel 中,找出特定員工的出勤時間,並將這些資料儲存到 Excel 中呢?
下面這個程式就可以實現這點
# 抓取資料夾底下的每個 Excel 檔
首先,我們先用 os.walk
抓出資料夾 (C:\Users\009641\Desktop\Dairy Access Report) 底下所有檔案 (fileName),再加上檔案路徑 (folderPath)後,就可以得出檔案的完整路徑
如: C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx
接著,因為每天的出勤報表都有相同的檔名開頭 (Access Report_),我們就可以用 if 'Access Report_' in fileName:
來只抓取需要的檔案路徑
利用 pd.read_excel
來讀取 Excel 內的檔案,再用 userInput
來抓取指定員工的紀錄
最後,再將這些記錄存到一開始建立的 dataframe cleanDf
即可
TIP
一開始在建立 dataframe 的時候,並不需要完整輸入每個欄位 (columns);其他欄位就算沒輸入也會在寫入時自動被補上 若要限制只顯示部分欄位的話,可在最後寫入資料表時指定
readPath = r'C:\Users\009641\Desktop\Dairy Access Report'
userInput = input("Please type the employee ID : ")
cleanDf = pd.DataFrame(columns=['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours'])
for folderPath, folderNameList, fileNameList in os.walk(readPath):
# only file name
for fileName in fileNameList:
#print('i = ', i)
# Only get the file names contain Access Report_
if 'Access Report_' in fileName:
fileNamePath = folderPath + '\\' + fileName # C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx
print('Reading file on', fileNamePath)
# Read the tab 'Summary with ADP'
df = pd.read_excel(fileNamePath, 'Summary')
# Convert user input from string to int
userInputInt = int(userInput)
# Add all selected data to clean dataframe
cleanDf = cleanDf.append(df[(df['ID'] == userInputInt)])
cleanDf['Total Working Hour'] = cleanDf['Total Work Time'].combine_first(cleanDf['Total Hours'])
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Source Code
import os
import re
import pandas as pd
readPath = r'C:\Users\009641\Desktop\Dairy Access Report'
userInput = input("Please type the employee ID : ")
cleanDf = pd.DataFrame(columns=['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours'])
for folderPath, folderNameList, fileNameList in os.walk(readPath):
# only file name
for fileName in fileNameList:
# Only get the file names contain Access Report_
if 'Access Report_' in fileName:
fileNamePath = folderPath + '\\' + fileName # C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx
print('Reading file on', fileNamePath)
# Read the tab 'Summary with ADP'
df = pd.read_excel(fileNamePath, 'Summary')
# Convert user input from string to int
userInputInt = int(userInput)
# Add all selected data to clean dataframe
cleanDf = cleanDf.append(df[(df['ID'] == userInputInt)])
# Rearrange the columns
cleanDf = cleanDf[['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours']
with pd.ExcelWriter(r'C:\Users\009641\Desktop\Dairy Access Report\All Time Record_' + userInput + '.xlsx') as writer:
cleanDf.to_excel(writer, sheet_name = "All Time Record", index=False)
print("Save data to Excel")
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
歡迎點擊追蹤: