data_flow.md 3.4 KB

太好了,在本地科研 + PyCharm 开发环境中,推荐你使用 Prefect 2.x 来管理中文专利 LDA 的数据流程。它轻量、Python 原生、易调试,非常适合你的场景。

✅ Prefect 本地科研使用方案(适合 PyCharm)

📦 一、安装 Prefect

pip install prefect

🧱 二、项目目录结构推荐

lda_pipeline/ ├── pipeline.py # Prefect 工作流定义 ├── data/ │ ├── raw/ # 原始专利摘要 │ └── processed/ # 分词/清洗结果 ├── modules/ │ ├── fetch.py # 数据获取模块 │ ├── preprocess.py # 清洗 + 分词模块 │ ├── lda_model.py # LDA 训练与保存 │ └── visualize.py # 可视化模块 └── config/

└── stopwords.txt

🚀 三、工作流定义示例(pipeline.py)

from prefect import flow, task from modules import fetch, preprocess, lda_model, visualize

@task def load_data():

return fetch.load_raw_data("data/raw/patents.json")

@task def clean_and_tokenize(texts):

return preprocess.clean_and_segment(texts, stopword_path="config/stopwords.txt")

@task def train_lda_model(tokenized_texts):

return lda_model.train(tokenized_texts, num_topics=10)

@task def show_topics(model):

visualize.show_wordclouds(model)

@flow def lda_pipeline():

data = load_data()
tokens = clean_and_tokenize(data)
lda_model_obj = train_lda_model(tokens)
show_topics(lda_model_obj)

if name == "main":

lda_pipeline()

运行这个脚本后,你的整个 LDA 流程就会按任务顺序自动执行,每个阶段都可单独调试。

📚 四、每个模块建议内容

modules/fetch.py

import json

def load_raw_data(path):

with open(path, 'r', encoding='utf-8') as f:
    return [entry["abstract"] for entry in json.load(f)]

modules/preprocess.py

import jieba

def clean_and_segment(texts, stopword_path):

stopwords = set(open(stopword_path, encoding='utf-8').read().splitlines())
result = []
for text in texts:
    words = jieba.lcut(text)
    filtered = [w for w in words if w not in stopwords and len(w) > 1]
    result.append(filtered)
return result

modules/lda_model.py

from gensim import corpora, models

def train(texts, num_topics=10):

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)
return lda_model

modules/visualize.py

from wordcloud import WordCloud import matplotlib.pyplot as plt

def show_wordclouds(model):

for i in range(model.num_topics):
    words = dict(model.show_topic(i, topn=30))
    wc = WordCloud(font_path='SimHei.ttf', width=800, height=400).generate_from_frequencies(words)
    plt.imshow(wc, interpolation='bilinear')
    plt.title(f"Topic {i}")
    plt.axis("off")
    plt.show()

🧪 五、调试技巧(PyCharm)

•   在每个 @task 函数中加日志 print() 或断点调试。
•   Prefect 是 Python 原生函数执行,支持逐行调试,体验非常适合科研场景。
•   你也可以在 PyCharm 的 terminal 中运行:

python pipeline.py

如果你需要,我可以帮你生成完整的项目模板压缩包或为你提供一个 Git 项目结构。要我给你打一个 zip 包或者脚手架初始化脚本吗?