如何用 Python 在不同的 Excel 中抓取資料並整合成一個 Excel

6/23/2021 PythonPandasExcel

在上一篇用 Python 得出資料夾項下的各子資料夾 (opens new window),我們知道了如何用 os.walk 抓出資料夾底下所有檔案及路徑

這篇,我們一樣利用這個方法來找出資料夾中的所有 Excel,並從這些 Excel 中抓出我們想要的資料

最後,再將這些資料存成一個 Excel


這篇接續這個案例:使用 Python 的 Pandas 套件來統計打卡時間 (opens new window)

我們每天都會產出一個出勤報表,但若要找出一個員工在一段區間的報表,那不就要打開每一天的資料夾,並從中找出該員工每天的出勤紀錄呢?

有沒有辦法在多個 Excel 中,找出特定員工的出勤時間,並將這些資料儲存到 Excel 中呢?

下面這個程式就可以實現這點

# 抓取資料夾底下的每個 Excel 檔

首先,我們先用 os.walk 抓出資料夾 (C:\Users\009641\Desktop\Dairy Access Report) 底下所有檔案 (fileName),再加上檔案路徑 (folderPath)後,就可以得出檔案的完整路徑

如: C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx

91-1

接著,因為每天的出勤報表都有相同的檔名開頭 (Access Report_),我們就可以用 if 'Access Report_' in fileName: 來只抓取需要的檔案路徑

利用 pd.read_excel 來讀取 Excel 內的檔案,再用 userInput 來抓取指定員工的紀錄

最後,再將這些記錄存到一開始建立的 dataframe cleanDf 即可

TIP

一開始在建立 dataframe 的時候,並不需要完整輸入每個欄位 (columns);其他欄位就算沒輸入也會在寫入時自動被補上 若要限制只顯示部分欄位的話,可在最後寫入資料表時指定

91-2

readPath = r'C:\Users\009641\Desktop\Dairy Access Report'
userInput = input("Please type the employee ID : ")

cleanDf = pd.DataFrame(columns=['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours'])

for folderPath, folderNameList, fileNameList in os.walk(readPath):
    # only file name
    for fileName in fileNameList:
        #print('i = ', i)
        # Only get the file names contain Access Report_
        if 'Access Report_' in fileName:
            fileNamePath = folderPath + '\\' + fileName  # C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx
            print('Reading file on', fileNamePath)

            # Read the tab 'Summary with ADP'
            df = pd.read_excel(fileNamePath, 'Summary')

            # Convert user input from string to int
            userInputInt = int(userInput)

            # Add all selected data to clean dataframe
            cleanDf = cleanDf.append(df[(df['ID'] == userInputInt)])


cleanDf['Total Working Hour'] = cleanDf['Total Work Time'].combine_first(cleanDf['Total Hours'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# Source Code

import os
import re
import pandas as pd

readPath = r'C:\Users\009641\Desktop\Dairy Access Report'
userInput = input("Please type the employee ID : ")

cleanDf = pd.DataFrame(columns=['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours'])

for folderPath, folderNameList, fileNameList in os.walk(readPath):
    # only file name
    for fileName in fileNameList:
        # Only get the file names contain Access Report_
        if 'Access Report_' in fileName:
            fileNamePath = folderPath + '\\' + fileName  # C:\Users\009641\Desktop\Dairy Access Report\2020\1201-1208\Access Report_1208.xlsx
            print('Reading file on', fileNamePath)

            # Read the tab 'Summary with ADP'
            df = pd.read_excel(fileNamePath, 'Summary')

            # Convert user input from string to int
            userInputInt = int(userInput)

            # Add all selected data to clean dataframe
            cleanDf = cleanDf.append(df[(df['ID'] == userInputInt)])


# Rearrange the columns
cleanDf = cleanDf[['ID', 'Date', 'Name', 'Swipe In', 'Swipe Out', 'Swipe Counts', 'VPN In', 'VPN Out', 'VPN Hours', 'Office Hours', 'Total Hours']

with pd.ExcelWriter(r'C:\Users\009641\Desktop\Dairy Access Report\All Time Record_' + userInput + '.xlsx') as writer:
    cleanDf.to_excel(writer, sheet_name = "All Time Record", index=False)

print("Save data to Excel")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Last Updated: 6/30/2023, 2:59:11 PM

歡迎點擊追蹤:

(adsbygoogle = window.adsbygoogle || []).push({});