快速解析 Python 的各種 Import Data 基礎應用技巧

13 min readNov 25, 2019

Introduction and flat files

Importing entire text files

# Open a file: file
file = open('moby_dick.txt', mode='r')# Print it
print(file.read())# Check whether file is closed
print(file.closed)# Close file
file.close()# Check whether file is closed
print(file.closed)# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Using NumPy to import flat files

# Import package
import numpy as np# Assign filename to variable: file
file = 'digits.csv'# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')# Print datatype of digits
print(type(digits))# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()# Import numpy
import numpy as np# Assign the filename: file
file = 'digits_header.txt'# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])# Print data
print(data)

different data type

# Assign filename: file
file = 'seaslug.txt'# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)# Print the first element of data
print(data[0])# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)# Print the 10th element of data_float
print(data_float[9])# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

datatype = None 用於讀取不同型態的資料

# Assign the filename: file
file = 'titanic.csv'# Import file using np.recfromcsv: d
d = np.recfromcsv(file) # 和 genfromtxt 的行為很像# Print out first three entries of d
print(d[:3])

Using pandas to import flat files

# Import pandas as pd
import pandas as pd# Assign the filename: file
file = 'titanic.csv'# Read the file into a DataFrame: df
df = pd.read_csv(file)# View the head of the DataFrame
print(df.head())# Assign the filename: file
file = 'digits.csv'# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, header=None, nrows=5)# Build a numpy array from the DataFrame: data_array
data_array = data.values# Print the datatype of data_array to the shell
print(type(data_array))# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt# Assign filename: file
file = 'titanic_corrupt.txt'# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')# Print the head of the DataFrame
print(data.head())# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

Importing API and JSONs

API (Application Performance Interface) ：應用程式介面的簡稱。

import JSON

import json# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

方式為首先引入 JSON 的模組，之後透過 Python 內建的 with 把資料用 json.load() 儲存進變數裡。

接下來如果要呈現裡面的 key 值，需要用 json.keys() 的 method 把內容提取出來；如果要提取的是其中的 value，則可透過 json_data[key] 來提出 value。

Importing APIs (Application Program Interfaces)

What is API? It is set of protocols and interfaces.

# Import requests package
import requests# Assign URL to variable: url
url = 'http://www.omdbapi.com?apikey=72bc447a&t=the+social+network'
# 從「？」後面開始就是搜尋式（如 ?t=tracker），通常會由 API 接口說明查詢方式# Package the request, send the request and catch the response: r
r = requests.get(url)# Print the text of the response
print(r.text)# Decode the JSON data into a dictionary: json_data
json_data = r.json()# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])# extract the data from json
pizza_extract = json_data['query']['pages']['24768']['extract']

搜尋式的概念：

?t=tracker

問號後面主要用於 Query string
在此例中，就是 return title from APIs

The Twitter API and Authentication

REST API: 是指能夠讀寫 Twitter 上面的資料。再利用 OAuth 來進行授權認證確認使用者身份。如果要使資料為 real-time 則可參考 streaming API。

PS：可以查查什麼是 RESTful API？

# Import package
import tweepy# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

首先把 tweepy 這個 tweeter 專屬的 API 套件引入進來。
把從 Tweeter 上的 Token 儲存到各個需要使用的變數裡，用於 OAuth 認證，總共有四個授權碼。
第一個，先把 OAuthHandler 的資料傳進去，第一個是放 key，第二個則是放 secret（好多秘密跟鑰匙？）
再來則是把 access_token 以 method 的方式傳遞 key 和 secret 進去。

# Initialize Stream listener
l = MyStreamListener()# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton','trump', 'sanders', 'cruz'])

接下來就開啟間聽器，並把驗證的 token 傳進去。
把傳進去的串流器宣告變數為 stream
利用 filter 的 method 把特定的用詞過濾出來。

# Import package
import json# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'# Initialize empty list to store tweets: tweets_data
tweets_data = []# Open connection to file
tweets_file = open(tweets_data_path, "r")# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)# Close connection to file
tweets_file.close()# Print the keys of the first tweet dict
print(tweets_data[0].keys())

隨後 DataCamp 直接幫你把資料儲存下來（黑箱？），並利用 Python 內建的 open 開啟檔案並以 for 來 append 進 list 裡。
隨後記得要把檔案關閉，避免浪費記憶。

# Import package
import pandas as pd# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])# Print head of DataFrame
print(df.head())

這個環節則是把特定的欄位讀取出來，並用 pandas 儲存進 df 的變數裡。

import re

def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)

    if match:
        return True
    return False
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])

利用 regular expression 來篩選特定的字詞，並把他儲存進 list of list 中來計數。

# Import packages
import matplotlib.pyplot as plt
import seaborn as sns# Set seaborn style
sns.set(color_codes=True)# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']# Plot the bar chart
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()