快速解析 Python 的各種 Import Data 基礎應用技巧
Introduction and flat files
Importing entire text files
# Open a file: file
file = open('moby_dick.txt', mode='r')# Print it
print(file.read())# Check whether file is closed
print(file.closed)# Close file
file.close()# Check whether file is closed
print(file.closed)# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
Using NumPy to import flat files
# Import package
import numpy as np# Assign filename to variable: file
file = 'digits.csv'# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')# Print datatype of digits
print(type(digits))# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()# Import numpy
import numpy as np# Assign the filename: file
file = 'digits_header.txt'# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])# Print data
print(data)
different data type
# Assign filename: file
file = 'seaslug.txt'# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)# Print the first element of data
print(data[0])# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)# Print the 10th element of data_float
print(data_float[9])# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
datatype = None 用於讀取不同型態的資料
# Assign the filename: file
file = 'titanic.csv'# Import file using np.recfromcsv: d
d = np.recfromcsv(file) # 和 genfromtxt 的行為很像# Print out first three entries of d
print(d[:3])
Using pandas to import flat files
# Import pandas as pd
import pandas as pd# Assign the filename: file
file = 'titanic.csv'# Read the file into a DataFrame: df
df = pd.read_csv(file)# View the head of the DataFrame
print(df.head())# Assign the filename: file
file = 'digits.csv'# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header=None, nrows=5)# Build a numpy array from the DataFrame: data_array
data_array = data.values# Print the datatype of data_array to the shell
print(type(data_array))# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt# Assign filename: file
file = 'titanic_corrupt.txt'# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')# Print the head of the DataFrame
print(data.head())# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Importing API and JSONs
API (Application Performance Interface) :應用程式介面的簡稱。
import JSON
import json# Load JSON: json_data
with open("a_movie.json") as json_file:
json_data = json.load(json_file)# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
方式為首先引入 JSON 的模組,之後透過 Python 內建的 with 把資料用 json.load() 儲存進變數裡。
接下來如果要呈現裡面的 key 值,需要用 json.keys() 的 method 把內容提取出來;如果要提取的是其中的 value,則可透過 json_data[key] 來提出 value。
Importing APIs (Application Program Interfaces)
What is API? It is set of protocols and interfaces.
# Import requests package
import requests# Assign URL to variable: url
url = 'http://www.omdbapi.com?apikey=72bc447a&t=the+social+network'
# 從「?」後面開始就是搜尋式(如 ?t=tracker),通常會由 API 接口說明查詢方式# Package the request, send the request and catch the response: r
r = requests.get(url)# Print the text of the response
print(r.text)# Decode the JSON data into a dictionary: json_data
json_data = r.json()# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])# extract the data from json
pizza_extract = json_data['query']['pages']['24768']['extract']
搜尋式的概念:
?t=tracker
- 問號後面主要用於 Query string
- 在此例中,就是 return title from APIs
The Twitter API and Authentication
REST API: 是指能夠讀寫 Twitter 上面的資料。再利用 OAuth 來進行授權認證確認使用者身份。如果要使資料為 real-time 則可參考 streaming API。
PS:可以查查什麼是 RESTful API?
# Import package
import tweepy# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
- 首先把 tweepy 這個 tweeter 專屬的 API 套件引入進來。
- 把從 Tweeter 上的 Token 儲存到各個需要使用的變數裡,用於 OAuth 認證,總共有四個授權碼。
- 第一個,先把 OAuthHandler 的資料傳進去,第一個是放 key,第二個則是放 secret(好多秘密跟鑰匙?)
- 再來則是把 access_token 以 method 的方式傳遞 key 和 secret 進去。
# Initialize Stream listener
l = MyStreamListener()# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton','trump', 'sanders', 'cruz'])
- 接下來就開啟間聽器,並把驗證的 token 傳進去。
- 把傳進去的串流器宣告變數為 stream
- 利用 filter 的 method 把特定的用詞過濾出來。
# Import package
import json# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'# Initialize empty list to store tweets: tweets_data
tweets_data = []# Open connection to file
tweets_file = open(tweets_data_path, "r")# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)# Close connection to file
tweets_file.close()# Print the keys of the first tweet dict
print(tweets_data[0].keys())
- 隨後 DataCamp 直接幫你把資料儲存下來(黑箱?),並利用 Python 內建的 open 開啟檔案並以 for 來 append 進 list 裡。
- 隨後記得要把檔案關閉,避免浪費記憶。
# Import package
import pandas as pd# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])# Print head of DataFrame
print(df.head())
這個環節則是把特定的欄位讀取出來,並用 pandas 儲存進 df 的變數裡。
import re
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])
利用 regular expression 來篩選特定的字詞,並把他儲存進 list of list 中來計數。
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns# Set seaborn style
sns.set(color_codes=True)# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']# Plot the bar chart
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()
最後把資料繪圖出來,發現 trump 被使用到的次數明顯高於其他人。
謝謝你/妳,願意把我的文章閱讀完
如果你喜歡筆者在 Medium 的文章,可以拍個手(Claps),最多可以按五個喔!也歡迎你分享給你覺得有需要的朋友們。