티스토리 파이썬 포스팅 글, 이미지 백업하기

urjent — Wed, 27 Sep 2023 13:37:43 +0000

티스토리(tistory) 백업이 필요해서 파이썬(스크래핑)으로 블로그 포스팅 글과 이미지를 PC 에 저장 하려는 분들이 계실거라고 생각이 듭니다.

오늘은 파이썬으로 티스토리를 백업하는 방법에 대해서 알아봅니다.

일단 코딩한 것이 동작하는 환경과 그 내역을 살펴보면,

-북클럽(Book Club) 스킨에서 카테고리 7~8개 만들고 포스팅 중입니다.

-포스트 주소는 숫자로 설정해서 사용 중입니다.

개발자도구(F12)에서 html 코드를 보고

-requests, BeautifulSoup를 통해 스크래핑 진행했으며,

–PIL Image 를 통해 이미지 다운로드 시 안보이는 확장자 문제를 해결하였습니다.

이미지는 src에 확장자(.jpg .png)까지 정확하게 된 것도 있었지만, 다음과 같은 형태로 포함된 URL을 가지고 있는 것도 있었습니다.

"https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=http%3A%2F%2Fcfile1.uf.tistory.com%2Fimage%2F993693465F20BB0F1FAFB6" src="https://t1.daumcdn.net/cfile/tistory/993693465F20BB0F1F" 

//i1.daumcdn.net/thumb/C176x120/?fname=https://t1.daumcdn.net/cfile/tistory/99400A3F5F21057413

PIL Image로 이미지 정보를 찾으면

img_url: https://t1.daumcdn.net/cfile/tistory/992895395F2040A804
img_format: PNG
imge_size: (830, 1019)
len(이미지): 41568

소스 코드는

from bs4 import BeautifulSoup
import requests
import os
from PIL import Image


def tistory_backup(post_num):

    for num in range(1, post_num + 1):
        url = 'https://본인의 티스토리 URL/' + str(num)
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        
        ### 포스팅 글 제목
        titles = soup.select_one('#content > div.inner > div.post-cover > div > h1')
        
        ### 등록일
        date = soup.select_one('#content > div.inner > div.post-cover > div > span.meta > span.date')
        
        if not titles or not date:
            continue
        
        print(titles.text)    
        print(date.text)
        
        ### 포스팅 내용
        entry_content = soup.find('div', {'class':'entry-content'})
        print(entry_content.get_text())
        
        res = requests.get(url)
        soup_img = BeautifulSoup(res.content, 'lxml')
        imgs = soup_img.select('img[src^=https]')  # https 로 시작하는 src, '//'로 시작하는 src 제외시킴
        print(f'이미지 수 : {len(imgs)}')
        # print(imgs)
        
        # 저장 디렉토리 만들기
        if not os.path.exists('tistoryBackup'):
            os.mkdir('tistoryBackup')
        if not os.path.exists('tistoryBackup/post_' + str(num)):
            os.makedirs('tistoryBackup/post_' + str(num))
        
        cnt = 1
        for img in imgs:
            img_url = img['src']
            
            ## pillow.Image로 이미지 format 알아내기
            imageObj = Image.open(requests.get(img_url, stream=True).raw)
            img_format = imageObj.format
            imge_size = imageObj.size
            print(f'img_url: {img_url}')
            print(f'img_format: {img_format}')
            print(f'imge_size: {imge_size}')
            print(f'os.path.basename(img_url): {os.path.basename(img_url)}')
            
            res_img = requests.get(img_url).content
            print(f'len(이미지): {len(res_img)}')  # requests의 .content는 bytes 타입을 리턴함
            
            if img_url.split('.')[-1] in ['png', 'jpg']:
                img_name = str(num) + '_' + str(cnt) + '_' + os.path.basename(img_url)
            else:
                img_name = str(num) + '_' + str(cnt) + '_' + 'no_filename_img.' + img_format
            
            print(img_name)
            
            if len(res_img) > 100:  # 이미지 용량이 00 bytes 이상인 것만
                with open('./tistoryBackup/post_' + str(num) + '/' + img_name, 'wb') as f:
                    f.write(res_img)
                cnt += 1
        
        title_content = titles.text + '\n' + date.text +  '\n' + entry_content.get_text()
        filename = str(num) + '_tistory_title_content.txt'
        with open('./tistoryBackup/post_' + str(num) + '/' + filename, 'w', encoding='utf-8') as f:
            f.write(title_content)
        
tistory_backup(20)

tistory_backup(20) 실행 시, 20은 포스트 주소의 숫자.
즉, https://abc4u.tistory.com/1 ~ https://abc4u.tistory.com/20 까지의 포스트 url을 대상으로 추출한다는 의미이며, 본인의 최근 포스팅 번호를 넣으면 1번 부터 최근 번호까지 전체가 추출됨

티스토리 백업

소스코드를 실행하면 위 탐색기 이미지처럼 폴더를 생성하고, 글은 .txt 파일로 저장하고 해당 포스트에 있는 이미지전체는 이름을 다시 만들어져서 저장됩니다.

백업 – 투데이즈.kr

티스토리 파이썬 포스팅 글, 이미지 백업하기

일단 코딩한 것이 동작하는 환경과 그 내역을 살펴보면,

소스 코드는