여행 뉴스(RSS)를 자동으로 수집해서 저장하는 방법

– 블로그 콘텐츠 준비가 쉬워지는 실전 자동화 –

여행 관련 뉴스를 꾸준히 모아서 정리하다 보면

언제는 기사 링크만 저장하다가, 언제는 정리 못한 채 넘어가기도 하죠.

그래서 저는 이 과정을 아예 자동화해서 매일 뉴스가 수집되도록 만들었어요.

오늘은 그 자동화 과정을 소개해볼게요.

1. 어떤 과정을 자동으로 처리하나요?

아래 코드 하나로 다음과 같은 작업을 모두 자동으로 처리할 수 있어요:

뉴스 제공 사이트의 RSS 주소에서 새 기사 목록 가져오기
기사 제목, 링크, 날짜, 요약 내용 추출
HTML 태그가 섞인 기사 요약은 자동으로 깨끗하게 정리
날짜 형식은 **읽기 쉬운 형태(YYYY-MM-DD)**로 변환
중복 기사는 자동으로 제거
결과를 한꺼번에 CSV, JSON, XML로 저장

2. 예시로 한국 여행 뉴스를 수집해봤어요

제가 이번에 사용한 RSS 주소는

한국관광공사 뉴스 RSS예요:

<https://korean.visitkorea.or.kr/rss/news/news.xml>

여기엔 국내 관광 소식, 축제 소식, 문화 소식 등

여행 블로그에 바로 쓸 수 있는 기사들이 꾸준히 올라와요.

이 주소만 코드에 입력하면,

새로운 기사들이 자동으로 정리돼서 파일로 저장돼요.

3. 파일은 이렇게 저장돼요

자동 생성되는 파일은 다음과 같이 날짜가 붙은 형태로 저장돼요:

korea_travel_20250604.xml
korea_travel_20250604.csv
korea_travel_20250604.json

파일명은 실행 날짜 기준으로 매일 새로 생성되기 때문에

날짜별로 정리하거나 백업할 때도 아주 편리해요.

4. 저장 위치도 직접 지정할 수 있어요

save_dir = "C:/Users/내이름/Desktop/rss_data"  # ← 이 부분만 바꾸면 돼요

이 줄 하나만 본인 컴퓨터에 맞게 바꿔주면,

수집된 뉴스가 원하는 폴더에 자동 저장돼요.

5. 여행 블로그 운영자에게 추천하는 활용 방법

자동 수집된 뉴스는 여러 방식으로 활용할 수 있어요:

블로그 글 콘텐츠 초안'이주의 관광 뉴스' 같은 시리즈로 만들기 좋아요.

→ 뉴스 제목과 링크, 요약만 추려서 포스팅하면

여행 뉴스 큐레이션 카드뉴스

→ 요약된 텍스트만 추려서 인스타그램이나 유튜브 쇼츠용으로 재구성해도 좋아요.

데이터 기반 분석어떤 키워드가 많이 등장하는지, 계절별 뉴스 주제를 분석할 수도 있어요.

→ CSV로 저장된 파일을 활용해

6. 필요한 설치는 간단해요

아래 명령어로 필요한 라이브러리를 설치하면 바로 실행할 수 있어요:

pip install requests beautifulsoup4 pandas

import os
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xml.etree.ElementTree as ET
from datetime import datetime
import json
import re
from urllib.parse import urljoin
from dateutil import parser

def clean_html(raw_html):
    """description에서 HTML 태그 제거 및 특수 문자 정리"""
    if not raw_html:
        return ""
    
    # HTML 태그 제거
    clean_text = re.sub('<.*?>', '', raw_html)
    # HTML 엔티티 디코딩
    clean_text = clean_text.replace('&lt;', '<').replace('&gt;', '>').replace('&amp;', '&')
    clean_text = clean_text.replace('&quot;', '"').replace('&#39;', "'")
    # 연속된 공백 정리
    clean_text = re.sub(r'\s+', ' ', clean_text)
    return clean_text.strip()

def format_date(date_str):
    """dateutil.parser를 사용해 유연하게 파싱하여 ISO 포맷으로 변환"""
    if not date_str:
        return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    
    try:
        dt = parser.parse(date_str)
        return dt.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(f"⚠️ 날짜 파싱 실패: {date_str} ({e})")
        return date_str

def validate_url(url):
    """URL 유효성 검사"""
    return bool(url and url.startswith(("http://", "https://")))

def main():
    try:
        # 저장 디렉토리 설정 (고정 경로)
        save_dir = r"[자신의 저장 경로를 적으세요!]"
        try:
            os.makedirs(save_dir, exist_ok=True)
            print(f"📁 저장 경로: {save_dir}")
        except Exception as dir_error:
            print(f"❌ 디렉토리 생성 실패: {dir_error}")
            return
        
        today_str = datetime.now().strftime("%Y%m%d_%H%M%S")
        rss_url = "https://www.korea.kr/rss/dept_mcst.xml"
        
        print("📡 RSS 피드 요청 중...")
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/91.0.4472.124 Safari/537.36'
        }
        
        response = requests.get(rss_url, timeout=30, headers=headers)
        response.raise_for_status()
        
        content_type = response.headers.get("Content-Type", "").lower()
        if "xml" not in content_type and "text" not in content_type:
            print(f"❌ 예상치 못한 Content-Type: {content_type}")
            print("서버 응답 일부:")
            print(response.text[:500])
            # XML이 아니어도 계속 파싱을 시도
        
        try:
            soup = BeautifulSoup(response.content, "xml")
        except Exception as parse_error:
            print(f"❌ XML 파싱 오류: {parse_error}")
            print("HTML 파서로 재시도...")
            soup = BeautifulSoup(response.content, "html.parser")
        
        items = soup.find_all("item")
        print("▶ items 개수:", len(items))
        print("▶ items[:2] 미리보기:", items[:2])
        if not items:
            print("⚠️ RSS에서 <item> 태그가 하나도 감지되지 않았습니다.")
            print(response.text[:1000])
            return
        
        print(f"📰 총 {len(items)}건의 기사를 발견했습니다.")
        
        articles = []
        seen_links = set()
        
        for idx, item in enumerate(items, 1):
            try:
                # ─────── 제목 추출 ───────
                title_tag = item.find("title")
                title = title_tag.get_text().strip() if title_tag else f"제목없음_{idx}"
                
                # ─────── 링크 추출 ───────
                link_tag = item.find("link")
                link_raw = link_tag.get_text().strip() if link_tag else ""
                link = urljoin(rss_url, link_raw) if link_raw else ""
                if not link:
                    guid_tag = item.find("guid")
                    guid_raw = guid_tag.get_text().strip() if guid_tag else ""
                    link = urljoin(rss_url, guid_raw) if guid_raw else ""
                
                if not link:
                    print(f"⚠️ 기사 {idx}: 링크가 없어 건너뜀")
                    continue
                if not validate_url(link):
                    print(f"⚠️ 기사 {idx}: 유효하지 않은 URL - {link}")
                    continue
                if link in seen_links:
                    print(f"⚠️ 기사 {idx}: 중복 링크 건너뜀")
                    continue
                seen_links.add(link)
                
                # ─────── 발행일 추출 ───────
                pub_date_tag = item.find("pubDate")
                pub_date = pub_date_tag.get_text().strip() if pub_date_tag else ""
                pub_date_iso = format_date(pub_date)
                
                # ─────── 요약(Description) 추출 ───────
                desc_tag = item.find("description")
                desc_raw = desc_tag.get_text() if desc_tag else ""
                description = clean_html(desc_raw)
                
                articles.append({
                    "id": len(articles) + 1,
                    "title": title,
                    "link": link,
                    "date": pub_date_iso,
                    "summary": description,
                    "raw_date": pub_date or ""
                })
                print(f"✅ 기사 {len(articles)}: {title[:50]}...")
            
            except Exception as item_error:
                print(f"❌ 기사 {idx} 처리 중 오류: {item_error}")
                continue
        
        if not articles:
            print("⚠️ 처리 가능한 기사가 없습니다.")
            return
        
        print(f"\n📊 데이터 수집 완료: {len(articles)}건")
        
        # ─────── XML 트리 생성 ───────
        print("🔧 XML 구조 생성 중...")
        xml_root = ET.Element("kto_press")
        xml_root.set("generated", datetime.now().isoformat())
        xml_root.set("source", rss_url)
        xml_root.set("total", str(len(articles)))
        
        for art in articles:
            try:
                article_elem = ET.SubElement(xml_root, "article", id=str(art["id"]))
                for tag_name, value in [
                    ("title", art["title"]),
                    ("link", art["link"]),
                    ("pubDate", art["date"]),
                    ("description", art["summary"])
                ]:
                    child = ET.SubElement(article_elem, tag_name)
                    child.text = value or ""
            except Exception as xml_err:
                print(f"⚠️ XML 구축 에러(기사 {art['id']}): {xml_err}")
                continue
        
        df = pd.DataFrame(articles)
        xml_path = os.path.join(save_dir, f"kto_press_{today_str}.xml")
        csv_path = os.path.join(save_dir, f"kto_press_{today_str}.csv")
        json_path = os.path.join(save_dir, f"kto_press_{today_str}.json")
        
        print("\n💾 파일 저장 중...")
        try:
            tree = ET.ElementTree(xml_root)
            tree.write(xml_path, encoding="utf-8", xml_declaration=True)
            print(f"✅ XML 저장 완료: {xml_path}")
        except Exception as e:
            print(f"❌ XML 저장 실패: {e}")
        
        try:
            df.to_csv(csv_path, index=False, encoding="utf-8-sig")
            print(f"✅ CSV 저장 완료: {csv_path}")
        except Exception as e:
            print(f"❌ CSV 저장 실패: {e}")
        
        try:
            with open(json_path, "w", encoding="utf-8") as f:
                json.dump({
                    "generated_at": datetime.now().isoformat(),
                    "source_url": rss_url,
                    "total_articles": len(articles),
                    "articles": articles
                }, f, ensure_ascii=False, indent=2)
            print(f"✅ JSON 저장 완료: {json_path}")
        except Exception as e:
            print(f"❌ JSON 저장 실패: {e}")
        
        print("✅ 저장 과정 완료!")
        print(f"\n📊 수집 결과: 총 기사 수 {len(articles)}건  /  수집 날짜: {today_str}")
        print(f"📁 저장 위치: {save_dir}")
        for ftype, fpath in [("XML", xml_path), ("CSV", csv_path), ("JSON", json_path)]:
            if os.path.exists(fpath):
                size = os.path.getsize(fpath)
                print(f"  ✅ {ftype}: {os.path.basename(fpath)} ({size:,} bytes)")
            else:
                print(f"  ❌ {ftype}: 파일 생성 실패")
        
        if articles:
            print("\n📰 수집된 기사 샘플:")
            for i, art in enumerate(articles[:3], 1):
                print(f"  {i}. {art['title'][:60]}…  📅 {art['date']}")
    
    except requests.exceptions.Timeout:
        print("❌ 요청 시간 초과: 서버 응답이 너무 느립니다.")
    except requests.exceptions.ConnectionError:
        print("❌ 연결 오류: 네트워크 연결을 확인해주세요.")
    except requests.exceptions.HTTPError as e:
        print(f"❌ HTTP 오류: {e} (상태 코드: {e.response.status_code if e.response else 'Unknown'})")
    except requests.exceptions.RequestException as e:
        print(f"❌ 네트워크 요청 중 오류: {e}")
    except PermissionError:
        print(f"❌ 파일 쓰기 권한 오류: {save_dir} 에 쓰기 권한이 없습니다.")
    except Exception as e:
        print(f"❌ 알 수 없는 오류: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    print("🇰🇷 한국관광공단 보도자료 RSS 수집기")
    print("=" * 50)
    main()
    print("=" * 50)
    print("프로그램이 종료되었습니다.")

save_dir = r"[자신의 저장 경로를 적으세요!]" 를 찾아 저장 경로만 넣으면 사용이 가능해요!

# 오류가 발행한다면 제가 겪었던 오류도 체크해 보세요.

“BeautifulSoup Tag에는 findtext()가 없어서, 호출 시 None이 되고 None(…) 형태로 잘못 쓰이면서 에러가 났다.”

- 따라서 find("…").get_text() 혹은 .text로 수정하니 정상 실행된 것입니다.

한마디로, 원래는 “잘못된 메서드(findtext)를 쓰면서 NoneType 호출 오류가 났다”가 핵심이었습니다.

마무리하며

뉴스를 수작업으로 모으고 정리하는 건 생각보다 시간이 오래 걸려요.

이렇게 한 번 자동화해두면, 매일 여행 뉴스가 자동으로 정리돼서

블로그 운영도 훨씬 효율적으로 할 수 있어요.

다음 글에서는 이 뉴스 데이터를 요약해서

‘짧은 블로그 글’로 자동 변환하는 방법도 공유해볼게요.

관심 있으시면 댓글로 알려주세요

저작자표시 비영리 변경금지 (새창열림)

'마케터 실무도구 > Python 마케팅을 위한 파이썬 도구' 카테고리의 다른 글

[Python] 주피터 노트북 설치부터 사용법까지 (0)	2025.06.04
[Python] 개발 몰라도 OK! 마케터를 위한 파이썬 오류 체크리스트 (1)	2025.06.04
[Python] 현수막 메뉴/가격 디자인 코드 (경로 직접 입력형) (0)	2025.05.18
ChatGPT와 파이썬으로 블로그 글쓰기 자동화하는 방법｜콘텐츠 제작 플로우와 프롬프트 정리 (0)	2025.05.13
[Python] 초보자의 효율적인 코딩 요청을 위한 ChatGPT 활용법 (0)	2025.05.12