快速解析 Python 的各種 Import Data 基礎應用技巧

學.誌|Chris Kang
13 min readNov 25, 2019

Introduction and flat files

Importing entire text files

# Open a file: file
file = open('moby_dick.txt', mode='r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())

Using NumPy to import flat files

# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')
# Print datatype of digits
print(type(digits))
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)

different data type

# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

datatype = None 用於讀取不同型態的資料

# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d = np.recfromcsv(file) # 和 genfromtxt 的行為很像
# Print out first three entries of d
print(d[:3])

Using pandas to import flat files

# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())
# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header=None, nrows=5)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

Importing API and JSONs

API (Application Performance Interface) :應用程式介面的簡稱。

import JSON

import json# Load JSON: json_data
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])

方式為首先引入 JSON 的模組,之後透過 Python 內建的 with 把資料用 json.load() 儲存進變數裡。

接下來如果要呈現裡面的 key 值,需要用 json.keys()method 把內容提取出來;如果要提取的是其中的 value,則可透過 json_data[key] 來提出 value

Importing APIs (Application Program Interfaces)

What is API? It is set of protocols and interfaces.

# Import requests package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com?apikey=72bc447a&t=the+social+network'
# 從「?」後面開始就是搜尋式(如 ?t=tracker,通常會由 API 接口說明查詢方式
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Print the text of the response
print(r.text)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
# extract the data from json
pizza_extract = json_data['query']['pages']['24768']['extract']

搜尋式的概念:

?t=tracker
  • 問號後面主要用於 Query string
  • 在此例中,就是 return title from APIs

The Twitter API and Authentication

REST API: 是指能夠讀寫 Twitter 上面的資料。再利用 OAuth 來進行授權認證確認使用者身份。如果要使資料為 real-time 則可參考 streaming API。

PS:可以查查什麼是 RESTful API?

# Import package
import tweepy
# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
  1. 首先把 tweepy 這個 tweeter 專屬的 API 套件引入進來。
  2. 把從 Tweeter 上的 Token 儲存到各個需要使用的變數裡,用於 OAuth 認證,總共有四個授權碼。
  3. 第一個,先把 OAuthHandler 的資料傳進去,第一個是放 key,第二個則是放 secret(好多秘密跟鑰匙?)
  4. 再來則是把 access_token 以 method 的方式傳遞 key 和 secret 進去。
# Initialize Stream listener
l = MyStreamListener()
# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)
# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton','trump', 'sanders', 'cruz'])
  1. 接下來就開啟間聽器,並把驗證的 token 傳進去。
  2. 把傳進去的串流器宣告變數為 stream
  3. 利用 filtermethod 把特定的用詞過濾出來。
# Import package
import json
# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'
# Initialize empty list to store tweets: tweets_data
tweets_data = []
# Open connection to file
tweets_file = open(tweets_data_path, "r")
# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
# Close connection to file
tweets_file.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())
  1. 隨後 DataCamp 直接幫你把資料儲存下來(黑箱?),並利用 Python 內建的 open 開啟檔案並以 for appendlist 裡。
  2. 隨後記得要把檔案關閉,避免浪費記憶。
# Import package
import pandas as pd
# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])
# Print head of DataFrame
print(df.head())

這個環節則是把特定的欄位讀取出來,並用 pandas 儲存進 df 的變數裡。

import re

def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)

if match:
return True
return False

# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]
# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])

利用 regular expression 來篩選特定的字詞,並把他儲存進 list of list 中來計數。

# Import packages
import matplotlib.pyplot as plt
import seaborn as sns
# Set seaborn style
sns.set(color_codes=True)
# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']
# Plot the bar chart
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

最後把資料繪圖出來,發現 trump 被使用到的次數明顯高於其他人。

謝謝你/妳,願意把我的文章閱讀完

如果你喜歡筆者在 Medium 的文章,可以拍個手(Claps),最多可以按五個喔!也歡迎你分享給你覺得有需要的朋友們。

--

--

學.誌|Chris Kang

嗨!我是 Chris,一位擁有技術背景的獵頭,熱愛解決生活與職涯上的挑戰。專注於產品管理/資料科學/前端開發 / 人生成長,在這條路上,歡迎你找我一起聊聊。歡迎來信合作和交流: chriskang0917@gmail.com